<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Squiggle: a Semantic Search Engine for indexing and retrieval of multimedia content</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>CEFRIEL - Politecnico of Milano</institution>
          ,
          <addr-line>Via Fucini 2, 20133 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Search engines are becoming such an easy way to find textual resources that we wish to use them also for multimedia content; however, syntactic techniques, even if promising, are not up to the task: future search engines must consider new approaches. Experimental prototypes of this search engine of the future are appearing. Most of them employs “smart machines” able to directly elaborate multimedia resources, but we believe that the solution should embrace also “smart data”, able to capture lexical and conceptual characteristics of a domain in an ontology. In order to prove that Semantic Web technologies provide real benefits to end users in terms of an easier and more effective access to information, we developed Squiggle, a Semantic Web framework that eases the deployment of semantic search engines. Following a model-driven approach model and the domain knowledge) part of the running code. We evalreal world deployments: one to search images of skiers for Torino 2006 Winter Olympic Games and one to search music files.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction
Finding resources in loosely structured repositories (e.g. the Web, a file server,
even our own mailboxes), in which users can freely add more data, is one of
the key problems of the “digital era”. In most cases, such problem is addressed
by offering a search engine that periodically crawls resources, indexes them and
enables fast searches over those indexes.</p>
      <p>Searching everything everywhere is becoming our habit when we need to
find something. We search Web pages in Web search engines, music using search
engines integrated in multimedia players, pictures in images organizer
applications, movies using systems such as Blinkx.tv, even personal stuff using desktop
searches.</p>
      <p>
        However, finding what we need is often a hard job. Current search engine
technology [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is very good in finding complete Web pages published all over
the world, but it lacks the desired precision1 and recall2 when searching for
multimedia resources. For instance, searching “jaguar” in an image search engine
1 Precision is the proportion of relevant data of all data retrieved.
2 Recall is the proportion of retrieved relevant data, out of all available relevant data.
Squiggle in section 3. In section 4 it is shown the conceptual architecture of the
Keeping in mind Tim Berners-Lee claim, we conceived, implemented and
deployed Squiggle3 (S emantically quilted google), an extensible semantic search
framework designed to add a conceptual flavour at indexing time and to exploit
as much as possible ontological elements to improve searching time.
      </p>
      <p>The paper is organized as follows. Section 2 outlines the current efforts for
extending syntactic search combining smart data (i.e., metadata defined by
ontologies using Semantic Web standards) and smart machines (e.g., image
processing). The approach we foresee for the “search engine of the future” follows
framework. Real world deployment of Squiggle, which demonstrate the
feasibility of our approach, are analyzed in section 5. Finally, concluding remarks are
provided in section 7.
(e.g., Google, Yahoo, MSN, etc.) results in a mix of felines and cars, which are
difficult to tell apart. Moreover, current technology is unable to cope with results
that requires either to extract a part of a resource (e.g., a scene from a movie)
or to aggregate numerous resources (e.g., relevant but scattered information
regarding a person).</p>
      <p>Furthermore, searching is an expensive activity. For instance, in a
mediumsized enterprise with 100 employees, each of them would perform around 10
searches per day (some on the Web, some on their mailboxes, etc.), stopping,
successfully or not, in 1-2 minutes. This means that 20-30 hours a day are spent
in searching. However, loosing time in searching is not the only source of costs,
since getting lost in results and missing relevant information can lead to loose
important opportunities. We may not care about the efficiency loss for an
enterprise, but, as citizens, we care about saving our own time: for example, if
we cannot find a document in an e-government website, we are forced to go
personally to a counter.</p>
      <p>
        What we really need is a search engine able to find any kind of multimedia
resources with the required level of granularity; but, how can we achieve this
search engine of the future? We believe that Tim Berners-Lee was right when,
drawing the “Semantic Web Roadmap”[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], he said:
      </p>
      <p>If an engine of the future combines a reasoning engine with a search
engine, it may be able to get the best of both worlds.</p>
      <p>
        Existing approaches to improve search engines
Syntactic search techniques are largely employed not only on the Web, but also
in many applications we daily use for our productivity or pleasure. A standard
“syntactic” search engine’s implementation is mainly based on three phases (cf.,
among others, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]):
– crawling time: it is the phase in which the resources (HTML pages,
multimedia contents, etc.) are collected in order to build a coherent (and more
homogeneous) set;
– indexing time: it is the phase during which the crawled resources are parsed
and indexed in some particular data structures; those structures are built
based on the relevant information contained in indexed resources and are
optimized to quickly answer to queries (granting search response-time in the
order of milliseconds);
– searching time: it is the run-time phase in which final users submit their
queries in order to retrieve meaningful results; in addition to the optimization
of the indexes, this phase requires also a good method to rank and/or cluster
search results.
      </p>
      <p>
        The reasons why Web search engines give such good results, in terms of
performance and scalability, relies in the very form of the Web, made up of
semistructured text and links. At indexing time, Web pages, being semi-structured
text, can be subject to a wide range of well known techniques for parsing tags,
tokenising contents of tags, and building indexes. Moreover, Web resources,
being interlinked, enable a straightforward implementation of spiders, which can
traverse the entire Web at crawling time, and the application of effective ranking
algorithms (e.g., PageRank [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) useful at search time.
      </p>
      <p>When the resources to be indexed are multimedia files instead of Web pages,
the automatic process of their content becomes very difficult and the lack of
links makes crawling a tricky problem and PageRank algorithm useless. The two
ways out are the use of smart machines and smart data.</p>
      <p>By smart machine we mean a bunch of techniques that includes text
processing (e.g., natural language processing), audio processing (e.g. acoustic
fingerprints, automatic speech recognition, etc.) and image/video processing (e.g.
computer vision, scene change detection, segment detection, etc.). Several search
engines that exploit smart machines are appearing. For instance, Retrievr4 finds
images by drawing rough sketches of them; Musipedia5 offers an interface for
querying by humming; ANSES6 (Automatic News Summarization and
Extraction System) captures TV news, extracts key scenes from the video, analyzes
the audio extracting references to key persons, places, date-time and enables
searches on the repository.</p>
      <p>
        On the other side, smart data is the base for search engines that exploits
semantics at search time to increase both recall and precision, with respect to
engines that are purely syntactic that do not exploit the possibility, offered by
Semantic Web standards, of modeling the domain both at lexical (i.e., SKOS
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) and at knowledge level (i.e., OWL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). Explicit representation of semantics
(both lexicon and knowledge), gives search engines the ability to disambiguate
between homonyms and expand the search to synonyms and pseudonyms.
Moreover, smart data gives search engines the possibility of augmenting the set of
relevant results by exploiting: hyperonymy and hyponymy at lexical level,
conceptual broadening and narrowing at knowledge level or any other relationship
      </p>
    </sec>
    <sec id="sec-2">
      <title>4 http://labs.systemone.at/retrievr 5 http://www.musipedia.org/ 6 http://www.doc.ic.ac.uk/∼mjp3/anses/</title>
      <p>that can be described by a domain ontology (e.g., “The Three Tenors” is the
brand used by Pavarotti, Domingo and Carreras when singing together).</p>
      <p>
        However, at search time all these features are available only if resources
are augmented with semantic annotations, which don’t come for free. The most
obvious way to semantically annotate resources is doing it manually (for instance
using Annotea [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or SMORE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Clearly, such manual process is affordable only
in specific domains in which either cultural reasons (e.g., the librarians have
been annotating and cataloging books since ever) or collaborative behaviors
(e.g., Wikipedia) make the annotation process sustainable. In all other cases,
some (semi)automatic annotation mechanisms is needed. To this end, combining
smart data with smart machine seems to be the right approach.
      </p>
      <p>
        There are several examples of existing approaches that try to combine
Semantic Web technologies with smart machines in search engines. One of the
most interesting is represented by KIM [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. KIM starts from the idea that a
practical semantic annotation is impossible without some particular knowledge
modeling commitments. Therefore, KIM introduces a light-weight upper-level
ontology, which encodes many of the domain-independent commonsense concepts
and allows domain-specific extensions. Moreover, KIM includes a semantically
enhanced information extraction system, which provides automatic semantic
annotation with references to classes and instances in the ontology. Finally, on the
basis of these semantic annotations, KIM performs semantic based indexing and
retrieval, where users can mix traditional queries and ontology-based ones.
      </p>
      <p>
        Some other interesting approaches in introducing semantics in indexing and
searching are:
– TAP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] Semantic Search, which has two goals: (1) augmenting traditional
search results with data pulled from the Semantic Web; (2) using an
understanding of the “denotation” of the search term to improve traditional
search, where denotation is the problem of identifying the concepts in the
search query;
– ALVIS [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] open-source semantic-based peer-to-peer search engine, which,
before indexing the documents, enriches them with some “semantic
awareness” of the specific subject, by using information extraction technology;
– ConWeaver [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] semantic search engine, which uses a knowledge network
(i.e., an ontology), as a rich knowledge-based search index, thus enabling
the restitution of search results for queries in multilingual, heterogeneous
data environments.
3
      </p>
      <p>Our steps towards the “search engine of the future”
In our opinion, what Tim Berners-Lee calls the “search engine of the future”
should have a structure similar to existing “syntactic” search engines, but should
also be enriched with machine-processable semantics. In our vision, domain
ontologies can be employed in empowering searching, indexing and also crawling.</p>
      <p>At crawling time, a previous knowledge about the domain can assist the
collecting of resources to be indexed, because this “know-how” can drive the
Squiggle analysis. In particular, is an extensible framework (see §4) designed to
Squiggle time (see case studies in §5).</p>
      <p>crawler to focus on relevant information even if links are not explicit. For
example, the awareness about a specific domain knowledge (e.g., the “Beatles” used
to be called the “fab four” and that they are frequently misspelled “Beetles”)
promotes and eases the individuation of the documents containing relevant
information (e.g., music of the Beatles within a folder full of mp3 files that normally
include in the filename the indication of the performer, in its multiple spellings).</p>
      <p>At indexing time, “a little semantics” can be introduced: the input
information can be analyzed by means of smart machines and tagged with respect to
its meaning before it is processed by the indexer tool. In this way, the tool is
able to index both the syntactic content of an input document and its attached
semantics (i.e., the unique identifiers of the concepts that tag the document).</p>
      <p>In brief, the indexing process can be driven by semantics, becoming an effective
conceptual indexing process.</p>
      <p>At searching time, domain ontologies can be employed to customize search
engine applications and to improve the user experience in terms of value added
and effectiveness of the search. If aware of specific concepts and properties, the
tool can help the user in refining his query both by clarifying the matter of his
search (“Did you mean. . . ?”) and by suggesting possible expansions of his query
to related subjects (“You could also be interested in. . . ”). The result is that the
user can find more easily what he was looking for.</p>
      <p>We conceived Squiggle, keeping in mind Tim Berners-Lee claim and the above
add a conceptual layer to indexing process and to exploit as much as possible
ontological elements to improve searching time, leaving to each domain-dependent
instantiation of the framework the choice of using ontologies also at crawling</p>
      <p>Conceptual architecture of the Squiggle framework
– a semantic search-engine, i.e. a semantic web application with searching</p>
      <p>functionalities; and
– a semantic-search engine, i.e. a search engine that is able to deal with the</p>
      <p>“meaning” of the searched information.
data to be
indexed</p>
      <sec id="sec-2-1">
        <title>Domain</title>
      </sec>
      <sec id="sec-2-2">
        <title>Knowledge</title>
        <p>
          (DK)
Squiggle is designed to provide both syntactic and semantic indexing and
Squiggle extensions to customize and strengthen its potentialities.
Squiggle search applications. In essence, is not a search engine itself, but it
alSquiggle The main components of are sketched in figure 1. In the following, we
lows users to customize their own engine on the basis of a particular domain
knowledge, as will be explained in section 5.
searching primitives, seamlessly combining the speed of syntactic search tools
with improved recall and precision, based on the ability to assign alternative
designations and wordings in multiple languages to their meaning. Among the
constituents of Squiggle, Sesame7 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is used to store and query semantic
information constituting the knowledge base, described in RDF/OWL with regard
to the SKOS model, whereas the syntactic search engine Lucene8 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is used,
among other things, to quickly perform text searches in literals, which is
something that semantic search tools typically cannot do well. Therefore the described
architecture lends itself well both to overcome the limitations of purely syntactic
approaches and to improve the performance of semantic engines.
provide a brief description of Squiggle’s Conceptual Indexing (§4.1) and Semantic
        </p>
        <p>Search (§4.2) capabilities. Moreover, in §4.3, we illustrate how to add plug-in</p>
      </sec>
      <sec id="sec-2-3">
        <title>Record</title>
      </sec>
      <sec id="sec-2-4">
        <title>Index</title>
      </sec>
      <sec id="sec-2-5">
        <title>Semantic</title>
      </sec>
      <sec id="sec-2-6">
        <title>Searcher</title>
        <p>data crawler</p>
      </sec>
      <sec id="sec-2-7">
        <title>Conceptual</title>
      </sec>
      <sec id="sec-2-8">
        <title>Indexer</title>
      </sec>
      <sec id="sec-2-9">
        <title>Indexed DK</title>
        <p>(IDK)
IDK</p>
      </sec>
      <sec id="sec-2-10">
        <title>Label</title>
      </sec>
      <sec id="sec-2-11">
        <title>Index</title>
        <p>web
pages
search engine
application</p>
        <sec id="sec-2-11-1">
          <title>Legend:</title>
        </sec>
        <sec id="sec-2-11-2">
          <title>Sesame</title>
        </sec>
        <sec id="sec-2-11-3">
          <title>Lucene</title>
          <p>4.1 Conceptual Indexing
Indexing can be defined as a complex process that transforms the input
information into a format that allows an efficient and easy search. Conceptual Indexing
is an indexing process that, in addition, tries to semantically annotate contents
with the corresponding “concepts”.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>7 http://openrdf.org/</title>
      <p>8 http://lucene.apache.org/
Squiggle can be used to perform searches both at the syntactic and at the
semantic level. The employment of Lucene in a “traditional” way (i.e., to index
text) assures the classical syntactic search. Moreover, the previous Conceptual
Indexing phase enables also semantic searches.</p>
      <p>Every time a user submits a query, Squiggle’s Semantic Searcher analyses it
and tries to identify the ontological elements – contained in Indexed Domain
Knowledge repository – that can be related to the request. Then, it is able to
“suggest” to the user the potential meanings of his query that it recognized; the
user is therefore presented with both the results of the syntactic search and the
available meanings extracted from the query, which can help him to refine his
request, “disambiguating” among its the possible acceptations.</p>
      <p>The MeaningSuggester is the abstraction we use to identify the “tool”
able to find ontological elements that have some semantic relationship with the
search. A MeaningSuggester accesses the Indexed Domain Knowledge
repository in order to extract concepts or instances thereof (from now on called
meanings) that answer the user query. Once a specific meaning is identified (i.e., the</p>
      <p>Squiggle’s semantic annotation process is very minimalist; it expects resources
to be already annotated with keywords and it searches in the Domain Knowledge
repository for concepts whose SKOS labels match those keywords. If a single
concept is retrieved, then the resource is annotated with it. Otherwise, we use
heuristic approaches to identify the closest concept(s). If also these heuristics
don’t work, some manual annotation is required. In any case, the identified
concepts are copied from the Domain Knowledge and inserted into the Indexed
Domain Knowledge for efficiency reasons9.</p>
      <p>The default implementation of Conceptual Indexing is straightforward: first
the input information is scanned and analyzed in order to identify and extract
the concepts that characterize it (the semantic annotation process described
above); then, these concepts are stored in the index for subsequent search and
retrieval (indexing process). The index consists of three parts:
1. the main part is a Lucene index (cf. Record Index in fig. 1), that indexes all
the textual parts and the URIs of the identified concepts, which are helpful
in the syntactic searching phase;
2. the “semantic” part is a Sesame repository (cf. Indexed Domain Knowledge
or IDK in fig. 1), that contains all the triples that the search engine will need
at runtime, when answering final user’s queries;
3. another smaller Lucene index (cf. IDK Label Index in fig. 1) with the role of
supporting Sesame repository, that indexes all the labels contained in IDK,
in order to speed up their retrieval at search-time10.
4.2</p>
      <p>Semantic Search
9 IDK is therefore the minimum useful subset of Domain Knowledge repository; the
creation of this repository aims at speeding-up searching time, in case DK contains
millions of triples, the majority of which is not used at run-time.
10 Being very fast in retrieving data in response to any syntactic search, Lucene can</p>
      <p>help in indexing all the literals contained in the “semantic” domain knowledge.
Squiggle framework is designed with a bunch of ready-to-use
MeaningSugSquiggle etc.). In table 1, we give more details on the process uses to “suggest”
user disambiguates by selecting one of the suggestions), a semantic search
consists in looking up the indexed resources whose semantic annotation corresponds
to that meaning.</p>
      <p>But there is much more: when a user query is reconducted to a specific
meaning, Squiggle’s Semantic Searcher is able not only to look up resources
semantically related to that meaning, but also to seek other concepts that could
be of interest for the user. This is possible because Squiggle’s Semantic Search
can navigate across the graph of interconnected elements of the domain ontology,
following “semantic paths” denoted by relations and attributes.</p>
      <p>A MeaningSuggester is therefore a common pattern that can have
multiple behaviors (i.e., different execution semantics), since it can find meanings
according to different criteria. For example, the MeaningSuggester may
return the meanings that match a given query after expansion of the query terms
to their aliases (playing on the existence of a skos:prefLabel and multiple
skos:altLabels for each concept); furthermore, the MeaningSuggester may
return meanings that are related to the query terms by conceptual narrowing
(skos:narrower) or broadening (skos:broader) in the domain knowledge and
so on.
gester behaviors, since it is based on SKOS model and therefore can exploit
SKOS properties (besides the previously cited ones, the MeaningSuggester
knows also relations like skos:subject, skos:related, skos:relatedPartOf,
meanings.</p>
      <p>4.3 Building extensions for Squiggle
Squiggle As described above, is designed by following a model-driven approach
Squiggle pabilities in indexing and searching. The flexibility of the framework is
Squiggle engine in a particular domain can easily employ and extend it by
creatthat uses SKOS as model. Therefore both the Conceptual Indexing and the
Semantic Search do not rely on concepts or properties defined in the domain
ontology, but on SKOS.</p>
      <p>However, to build a more powerful semantic search engine, knowledge about
the domain can be inserted into the system, in order to exploit Squiggle’s
cabased on its ability in adapting to any real case: whoever needs to build a search
ing ad hoc plug-ins, gaining not only in recall and precision during searches, but
also in a more reliable indexing. Therefore, semantic annotation and meaning
suggestion plug-ins should be added as part of the deployment.</p>
      <p>As we reported in section 4.1, Squiggle has a minimalist semantic annotation
plug-in that is sufficient to index resources for which semi-structured annotation
are already available (i.e., ID3 metadata in mp3 files). For all other cases, we
provide a way for integrating custom semantic annotation techniques as
plugins. For instance, acoustic fingerprints (e.g. MusicIP’s PUIDs11) identify audio
files, basing on the contained audio data with good tolerance to encoding,
resampling and noise. Combining such smart machine technique with smart data
11 http://wiki.musicbrainz.org/HowPUIDsWork
Representation</p>
      <p>Execution Semantics of
MeaningSuggester
Squiggle by the framework, reflects the most common needs, since it exploits
gen(e.g. MusicBrainz), deriving author and title is almost automatic (cf. Squiggle
Music deployment in §5.2).</p>
      <p>As we reported in section 4.2, the predefined MeaningSuggester, provided
eral semantic relations. On the other hand, when building a search application
within a specific domain, managing other relations is useful too. This aim is
easily achieved by defining one or more domain-aware MeaningSuggester
behaviours: each one of them will follow a specific semantic “path” connecting
resources. For example, in Squiggle Music application (see below §5.2), we
created a SongArtistMeaningSuggester that finds the songs connected through the
domain property music:performedBy to a given artist.</p>
      <p>same SKOS label for differ- given a term, it returns the SKOS
ent OWL classes prefLabel of each OWL class that has</p>
      <p>the term as SKOS label
pseudonymy SKOS altLabels of the same given a SKOS altLabel of an OWL</p>
      <p>OWL class class, it returns all its SKOS labels
synonymy different SKOS labels of the given a SKOS concept, it returns all the
same SKOS concept SKOS labels of the same SKOS concept
SKOS labels of OWL classes given an OWL class, it return all the
related by equivalence prop- SKOS labels of the equivalent OWL
erty classes
hyperonymy SKOS concepts related by given a SKOS concept, it returns</p>
      <p>SKOS broader relationship the SKOS concepts related via SKOS</p>
      <p>broader property
entailment among OWL given an OWL class, it returns its
classes super-classes
hyponymy SKOS concepts related by given a SKOS concept, it returns</p>
      <p>SKOS narrower relationship the SKOS concepts related via SKOS</p>
      <p>narrower property
entailment among OWL given an OWL class, it returns its
subclasses classes
meronymy SKOS concepts related by given a SKOS concept, it returns</p>
      <p>SKOS relatedPartOf prop- the SKOS concepts related via SKOS
erty relatedPartOf property
generic seman- SKOS concepts related by given a SKOS concept, it returns
tic relationship SKOS related property the SKOS concepts related via SKOS
related property</p>
      <p>Squiggle real-world deployments
Squiggle Once the domain knowledge is formalized, can be further configured</p>
      <p>To build a domain-specific search engine, the first indispensable ingredient
is, of course, the domain knowledge, which is essential to the indexing process,
because it is the basis on which concepts can be identified within the input
information. We assume that this knowledge is systematized and formalized
in a machine processable structure; in particular, in RDF/OWL with regards
to SKOS model. In our architecture, the formalized knowledge is stored in a
Sesame repository called Domain Knowledge (see figure 1). This repository is
not a proper part of Squiggle, but it is evident that no application can be built
without it.
with domain-specific extensions, as explained in §4.3.</p>
      <p>For what regards searching time, the choice we made was to put user
experience at the center of our investigation: we designed the search interface to
be the simplest possible and the query system to be the most intuitive for the
user. Therefore, we chose to have a very minimalistic query form that accepts
keyword-based requests, like the most common search engines we are used to,
hiding behind this simple interface the elaborations to analyze the query and to
retrieve results of interest. Furthermore, the additional supporting “suggestions”
the user receives in response to his queries are displayed in clear and non-invasive
side-boxes.</p>
      <p>The aim of this simplification of the user interface is to provide fresh and
innovative features without overloading their presentation and maintaining an
immediate and clean visual experience.
12 See also http://www.cefriel.it/press/olimpiadi2006.html?lang=en</p>
      <p>Submit Query</p>
      <p>These examples simply highlight that implicit language assumptions, which
hold in searching for Web pages, must be reconsidered when searching for images.</p>
      <p>In other words, searching for Web pages using a given language has, in general,
a strict implication on the language of the page, whereas searching for images
has not. A semantic search engine can handle this implicit knowledge by means
of an ontology that represents the domain, both conceptually (including OWL
classes for athletes, resorts, disciplines and the respective instances) and lexically
(including multiple SKOS labels in multiple languages for each concept).</p>
      <p>In order to instantiate Squiggle Ski, we built a domain ontology partially
by hand, developing a small multilingual taxonomy13 of the disciplines in the
sectors of alpine skiing, and partially by collecting information on the FIS-Ski
web site14, from which we collected all the names of the athletes that got a
podium, as well as all the events in the last three years and the relationships
with the nations that hosted them, the top three athletes of the event and the
type of discipline.</p>
      <p>Then we built an experimental focused crawler that exploits the knowledge
in the ski ontology to collect images of skiers from sport news Web sites all over
the world. The awareness about all relevant terms (names of athletes, disciplines,
places, etc. with possible alternative labels in different languages) helps both the
focused crawler to filter the appropriate photos and the Conceptual Indexer to
13 Disciplines’ concepts have labels in English, Italian, German, French, Swedish,
Nor</p>
      <p>wegian and Finnish.
14 http://www.fis-ski.com/
semantically annotate them before the indexing process. The crawler was tested
for a couple of month before Torino 2006 and it was on for the entire Torino
2006 event. It collected more than 1800 images.</p>
      <p>Squiggle Ski is on-line at http://squiggle.cefriel.it/ski; during Torino
2006 event, it was visited by almost one thousand visitors searching for the
various athletes that won a medal in the alpine-skiing races. When you open the
home page, you are presented with an ordinary search box. If you try searching
for “Herminator abfahrt” (being “Herminator” a nickname for Hermann Maier
and “abfahrt” the German for downhill), you receive a plain syntactic search, and
in a box on the right Squiggle Ski asks if you mean the athlete “Hermann Maier”
and the discipline “downhill”. If you eventually follow Squiggle Ski suggestions,
all the images of Hermann Maier in a downhill race contained in the repository
are retrieved, disregarding the language used in the initial query; an explanation
box shows how Squiggle Ski expanded the query to achieve the result (see figure
2 for an example).</p>
      <p>Squiggle Squiggle Music is an instantiation of framework in the music field. We
noticed that both very diffuse media-players and popular sites for buying music
fail to retrieve tracks when alternative wordings or translations are used (e.g.
searching “rhcp” does not always retrieve the list of all Red Hot Chili Peppers
tracks in the repository).</p>
      <p>Squiggle Music indexes audio files (mainly mp3 files) enriching them with
information about authors, song titles and music genres. Squiggle Music is publicly
available at http://squiggle.cefriel.it/music.</p>
      <p>The sources of these data are two freely available meta-databases developed
and maintained by web communities: MusicMoz15 and MusicBrainz16. The latter
contains very comprehensive information concerning names of music bands and
titles of tracks as well as associations between different bands/artists. The
former is a smaller set of XML documents that, however, also includes a taxonomy
of musical styles and associations between bands/artists and styles. To build the
Domain Knowledge repository we merged the data contained in these
repositories, creating a SKOS-based RDF/OWL ontology; the resulting knowledge base
contains more than two million triples.</p>
      <p>In order to employ this huge knowledge while indexing, we made use of an
interesting case of audio recognition technique; MusicBrainz, in fact, for each
song, offers its TRM id17. TRM is an audio fingerprinting technology that
generates a unique fingerprint for an audio file based on an analysis of the acoustic
properties of the audio itself. This “barcode” is independent from the particular
file format and can be extracted from almost any audio file; each unique
fingerprint can therefore be compared with existing databases of fingerprints in order
15 http://www.musicmoz.org/
16 http://www.musicbrainz.org/
17 http://www.relatable.com/tech/trm.html
Squiggle as a domain-dependent plug-in of framework during the Conceptual
Into identify the specific musical track precisely. Therefore, using tools like
QuickNamer18, a small stand-alone application that is able to calculate the TRM of
a file, and searching in MusicBrainz database for matching, it’s possible to put
an mp3 file in relation with the song’s metadata19.</p>
      <p>Combining the smart data from MusicBrainz and MusicMoz with a smart
machine like QuickNamer, we built an automatic semantic annotator that acts
dexing phase. This annotator is therefore able to add to each file all its metadata
(artist, song title, etc.).</p>
      <p>From the final user’s point of view, the search facilities offer many interesting
options. Besides the usual “suggestion” of meanings through the analysis of user’s
query, Squiggle Music is able to perform a query expansion and to present the
user with other results that could be of his interest. This is possible because the
Semantic Search of Squiggle Music is strongly founded on the different (default
and ad hoc) MeaningSuggester behaviours: Squiggle Music can suggest to
the user related artists when searching for a performer, songs by the same artist
when looking for a song, broader and narrower styles when asking for a music
genre.</p>
      <p>For example, if the user query is “rhcp”, the system suggests “Red Hot
Chili Peppers” as one of the possible meanings (“rhcp” being a known acronym);
selecting the proposed meaning results in better recall, since only few songs are
likely to match the search string syntactically. A query for the “Queen” band
will propose, e.g., related artists such as Freddie Mercury (who was part of the
band). Finally, asking for “Celtic” genre, Squiggle Music can propose to widen
the query suggesting “World” music, a broader meaning, and “Celtic Pop”
style, a narrower meaning.
6</p>
      <p>Future works
18 http://phonascus.sourceforge.net/
19 An alternative of TRM ids is today represented by PUIDs (see http://www.</p>
      <p>musicdns.org/); MusicBrainz is currently enclosing also this kind of acoustic
fingerprint in its repository.
20 http://seip.cefriel.it
7</p>
      <p>Conclusions
Squiggle to compare behavior against other search engines. The main obstacle
Squiggle Squiggle systems, since – or rather each instance of framework – is
Squiggle solution; our (subjective) opinion is that is a user-friendly tool: the user
Squiggle interacts with exactly as he/she does with any other search engine, but</p>
      <p>Finally, in order to better highlight the value-added brought by Squiggle, we
plan to use a formal evaluation framework to assess our solution. We aim at
setting up a scheme to take the necessary measurements of precision/recall and
we meet in designing this evaluation framework is the difficulty to compare
Squiggle with other search engines, because of the lacking of “homogeneous”
strictly domain-dependent and cannot be easily compared to general-purpose
search engines.</p>
      <p>In additions, we are planning some user trials to measure the usability of our
he/she can have a better experience because of the suggestions of meanings.</p>
      <p>Squiggle capabilities. The extensible nature of fully enables the joint
employSquiggle Moreover, we designed keeping in mind the particular needs of
searching when dealing with multimedia contents. In particular, we forecast a
larger adoption of smart machines to process media, in order to promote a
better generation and aggregation of smart data while exploiting media-dependent
ment of smart machines and domain ontologies. It is worth noting that one of
the major cons of a semantic approach is the availability of domain ontologies
that formalize a specific knowledge in a structured way. With this regard, we
strongly wish for a large-scale adoption of SKOS model for the development of
ontologies, because we believe that SKOS represents a good trade-off between
the expressiveness of an ontological language and the simplicity of organization
systems (like categorizations and taxonomies).</p>
      <p>Squiggle Finally, we admit that a semantic search engine developed with
is strongly domain-dependent and cannot compete with general-purpose search
engines; however, we definitely believe that such an approach provides better
results, especially when dealing with multimedia contents, because a focused
tool better meets specialized needs, helping you in finding what you’re really
looking for.
Acknowledgments
The research has been partially supported by the NeP4B project, funded by
Italian Ministry of University and Research (MIUR project, FIRB-2005). We
would like to thank our former colleagues Davide Martinenghi and Francesco
Dolcini for their help and support in designing and developing Squiggle.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Brin</surname>
          </string-name>
          , Lawrence Page:
          <article-title>The Anatomy of a Large-Scale Hypertextual Web Search Engine</article-title>
          .
          <source>Computer Networks and ISDN Systems</source>
          <volume>30</volume>
          (
          <issue>1-7</issue>
          ) (
          <year>1998</year>
          )
          <fpage>107</fpage>
          -
          <lpage>117</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee:
          <article-title>Semantic Web Road map</article-title>
          . Available on the web at http://www. w3.org/DesignIssues/Semantic.html (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Page</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winograd</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The pagerank citation ranking: Bringing order to the web</article-title>
          .
          <source>Technical report, Stanford Digital Library Technologies Project</source>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Miles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : SKOS Core Guide, W3C Working Draft. http://www.w3. org/TR/swbp
          <article-title>-skos-core-guide (2 November 2005</article-title>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.: OWL</given-names>
          </string-name>
          <string-name>
            <surname>Web Ontology</surname>
            <given-names>Language</given-names>
          </string-name>
          , W3C Recommendation. http://www.w3.org/TR/owl-semantics
          <source>/ (10 February</source>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Jos</given-names>
            <surname>Kahan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marja-Riitta Koivunen</surname>
            ,
            <given-names>E.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swick</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>Annotea: An Open RDF Infrastructure for Shared Web Annotations</article-title>
          .
          <source>In: Proceedings of the WWW10 International Conference</source>
          , Hong Kong. (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kalyanpur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golbeck</surname>
          </string-name>
          , J.:
          <string-name>
            <surname>SMORE - Semantic</surname>
            <given-names>Markup</given-names>
          </string-name>
          , Ontology, and RDF Editor.
          <article-title>(</article-title>
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Atanas</given-names>
            <surname>Kiryakov</surname>
          </string-name>
          , Borislav Popov, Ivan Terziev, Dimitar Manov, Damyan Ognyanoff: Semantic Annotation, Indexing, and
          <string-name>
            <surname>Retrieval</surname>
          </string-name>
          .
          <source>Elsevier's Journal of Web Semantics</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ) (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>R.V.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Rob McCool: TAP: A Semantic Web</surname>
          </string-name>
          <article-title>Platform</article-title>
          .
          <source>in proceedings of the Eleventh International World Wide Web Conference (WWW2002)</source>
          , May 7-11, Honolulu, Hawaii, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wray</surname>
          </string-name>
          <article-title>Buntine: Open Source Search: A Data Mining Platform</article-title>
          .
          <source>Invited talk given at the Fourth IEEE International Conference on Data Mining</source>
          , Brighton, UK. (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Thomas Kamps: ConWeaver. See http://www.conweaver.de (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Arjohn</surname>
            <given-names>Kampman</given-names>
          </string-name>
          , Frank van Harmelen, Jeen Broekstra:
          <article-title>Sesame: A generic architecture for storing and querying rdf and rdf schema</article-title>
          .
          <source>in proceedings of ISWC 2002, October</source>
          <volume>7</volume>
          - 10, Sardinia, Italy (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Otis</surname>
            <given-names>Gospodnetic</given-names>
          </string-name>
          , Erik Hatcher:
          <article-title>Lucene in action</article-title>
          .
          <source>Manning Publications</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. I.
          <string-name>
            <surname>Celino</surname>
          </string-name>
          , E. Della Valle:
          <article-title>Multiple vehicles for a semantic navigation across hyperenvironments</article-title>
          .
          <source>In proceedings of the Second European Semantic Web Conference, ESWC2005</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>