Semantic Data Clouding over the Webs
                                    - Ph.D. Thesis? Abstract -


                                            Gaia Varese

                                 Università degli Studi di Milano
                              Via Comelico, 39 - 20135 Milano, Italy
                               gaia.varese@dico.unimi.it


         Abstract. Very often, for business or personal needs, users require to retrieve, in
         a very fast way, all the available relevant information about a focused target en-
         tity, in order to take decisions, organize business work, plan future actions. To an-
         swer this kind of “entity”-driven user needs, a huge multiplicity of web resources
         is actually available, coming from the Social Web and related user-centered ser-
         vices (e.g., news publishing, social networks, microblogging systems), from the
         Semantic Web and related ontologies and knowledge repositories, and from the
         conventional Web of Documents. The Ph.D. thesis is devoted to define the no-
         tion of i-cloud and a semantic clouding approach for the construction of i-clouds
         that works over the Social Web, the Semantic Web, and the Web of Documents.
         i-clouds are built for a target entity of interest to organize all relevant web re-
         sources, modeled as web data items, into a graph, on the basis of their level of
         prominence and reciprocal closeness.

         Keywords: Semantic clouding, i-clouds, Social Web, Semantic Web


1     The research question of the thesis

The user expectations on the quality of results of web information searches are becom-
ing more and more high. Very often, for business or personal needs, users require to
retrieve, in a very fast way, all the available relevant information about a focused target
entity, in order to take decisions, organize business work, plan future actions. A target
entity is a keyword-based representation of a topic of interest, namely a real-world ob-
ject/person, an event, a situation, or any similar subject that can be of interest for the
user. To answer this kind of “entity”-driven user needs, a huge multiplicity of web re-
sources is actually available, coming from the Social Web and related user-centered ser-
vices (e.g., news publishing, social networks, microblogging systems), from the Seman-
tic Web and related ontologies and knowledge repositories, and from the conventional
Web of Documents. Each kind of web resource is differently structured according to a
variety of formats, ranging from short, unstructured, and ready-to-consume news/posts,
to well-structured, formal ontology, and each one can provide unique information for a
given target entity. For example, only web resources coming from the Social Web are
able to provide subjective information reflecting users opinions or preferences about
?
    Ph.D. thesis supervisor: Prof. Silvana Castano - Università degli Studi di Milano.
the target entity, which complement in a useful way the more objective information
provided by web resources coming from the other webs. To satisfy user expectations,
a new generation of web information search techniques has to cope with different re-
quirements: i) the capability to span across multiple webs, to properly consider the wide
variety of available web resources and pieces of knowledge by properly assessing their
information contribution nature; ii) the capability to anticipate the user needs by pro-
viding a focused but comprehensive set of web resources relevant for the target entity;
iii) the capability to semantically organize all retrieved web resources into an intuitive
and coherent structure for the given target entity.
     With respect to this scenario, the Ph.D. thesis is devoted to define the notion of
i-cloud and a semantic clouding approach for the construction of i-clouds that works
over the Social Web, the Semantic Web, and the Web of Documents. i-clouds are built
for a target entity of interest to organize all relevant web resources, modeled as web
data items, into a graph, on the basis of their level of prominence and reciprocal close-
ness. Prominence captures the importance of a web resource within the i-cloud, by
distinguishing, also in a visual way “a la tag-cloud”, how much relevant web resources
are with respect to the target entity. The level of closeness between web resources is
evaluated using matching and clustering techniques, with the goal of determining how
similar web resources are to each other and with respect to the target entity.
     The research methodology followed for the Ph.D. activity is based on the following
main phases: i) literature review with the aim of providing a critical comparison of the
state of the art solutions for semantic data clouding, ii) conceptual design where re-
quirements and foundational aspects related to the Ph.D. issues are formally addressed,
iii) prototype implementation where a prototype tool is developed according to the de-
fined architecture, and iv) evaluation of the proposed techniques on a number of real
test cases.


2    Related work
Relevant research work with respect to the Ph.D. thesis regards Linked Data, instance
matching, and data clouds.

Linked Data. A new generation of web applications for the integration of both data and
services is being emerging in the context of the Linked Data project [2]. Linked Data is
mainly focused on the idea of improving interoperability and aggregation among large
data collections already available on the web, such as for example DBLP 1 , DBPedia 2 ,
CiteSeer 3 , IMDB 4 , and Freebase 5 , which are available as retrievable RDF datasets or
SPARQL query endpoints. Linked Data is a step beyond the simple availability of data
and syntactic compatibility, in that it promotes some important principles in making
web data available and sharable to the Semantic Web community. Such principles are
 1
   http://www.informatik.uni-trier.de/~ley/db
 2
   http://dbpedia.org
 3
   http://citeseerx.ist.psu.edu
 4
   http://www.imdb.com
 5
   http://www.freebase.com
the following: i) all the web resources have to be referenced by a URI; ii) URIs have to
be resolvable on the web to RDF descriptions; iii) RDF triples have to be consumed by
a new generation of Semantic Web browsers and crawlers [15]. However, Linked Data
does not take into account the web resources originated from user-generated contents
like comments, posts and personal feeds, that are characterized by poor structure and
rapid obsolescence. Moreover, Linked Data builds a flat graph structure of intercon-
nected URIs, without distinguishing the prominence and closeness of web resources.

Instance matching. The same real-world object can be described multiple times in
different knowledge repositories, possibly using different perspectives and by empha-
sizing different properties of interest. The capability of finding similar object descrip-
tions assumes particular relevance in the field of Semantic Web, to promote effective
web resource sharing on the global scale and to correctly interoperate/reuse individual
knowledge chunks coming from disparate information repositories, disregarding their
specific URIs. Such task is called instance matching, and consists in finding instances
(i.e., object descriptions), coming from different sources, which describe the same real-
world object in a different and heterogeneous way. Some contributions in this direction
have been focused on defining techniques and approaches for the generation and man-
agement of identifiers at object-level, like, for example, the OKKAM project [3]. Other
approaches have been proposed for the unification of different URIs associated to the
same object [13]. Moreover, a problem related to instance matching is the one of find-
ing object descriptions referring to similar objects. To this end, suitable matching tech-
niques are required. Such techniques are mainly provided by the research work in the
field of record linkage, which has been widely studied in the databases community [8].
More recently, some new techniques have been proposed to specifically match ontology
instances [9] and to identify similar web resources [11]. However, none of the proposed
approaches is able to compare different kinds of object.

Data clouds. In the recent years, the traditional World Wide Web based on “user-
consuming” applications and informative web pages has changed into a more complex
vision composed of a plurality of webs, where semantic-intensive applications as well
as interactive “user-generated” platforms like microblogging, and news feeds are be-
coming more and more popular. In this scenario, the research efforts towards the de-
velopment of solutions for organizing this huge amount of web resources according to
semantic clouding or similar approaches is still at an initial stage [10]. Some interest-
ing work has been done in the field of news aggregation, with the aim of providing
techniques for their semantic organization and classification. Examples of proposed
systems are NewsInEssence [14] and Relevant News [1], which automatically group
news related to the same topic by exploiting hierarchical clustering algorithms and
tag/keyword-based search functionalities. For what concerns microdata sources, like
Twitter or Facebook, tools for semantic aggregation are still missing. In the same di-
rection, structured and collaborative search engines are being emerging as a promising
solution for presenting the query results in a sort of structured form. Examples in this
field are Wolfram Alpha 6 and Google Wonder Wheel 7 . In particular, Wolfram Alpha
 6
     http://www.wolframalpha.com
 7
     http://www.googlewonderwheel.com
is a computational knowledge engine based on data extraction from popular knowledge
repositories, like Wikipedia. The goal of this engine is to provide answers to the user
requests by returning a comprehensive picture of the available data retrieved about the
given request. The same idea is enforced by Google Wonder Wheel, which provides
also a graphical, cloud-oriented view of the query results based on terminological simi-
larities among different web resources. However, all these proposed solutions still lack
the integration between Social and Semantic Web resources, and provide a poor support
of semantic matching techniques for identifying similar web resources.


2.1    Contributions of the thesis

With respect to the state of the art, the contributions of the Ph.D. thesis are mainly the
following.

    – Definition of a cross-web approach considering the different kinds of available web
      resources (e.g., tagged resources, microdata resources, Semantic Web resources),
      and considering both objective and subjective information. As far as we know, our
      semantic clouding approach represents a first attempt to bridge the gap between
      Semantic Web resources (typically managed in Linked Data) and other kinds of
      web resource, such as, for example, tagged and microdata resources.
    – Definition of i-cloud as a new data structure for organizing relevant web resources
      for a given target entity on the basis of their prominence and closeness.
    – Definition of matching techniques for comparing different kinds of web resources.

In particular, in Table 1, the differences between Linked Data and i-clouds are summa-
rized.

                                 Linked Data                                                     i-cloud
                            Resulting structure: graph                                  Resulting structure: graph
           Aim: connect different RDF descriptions of the same object   Aim: organize the relevant web resources for a target entity
                                 Off-line process                                             On-line process
              One general graph (connecting different repositories)                  One graph for each target entity
                                  Directed graph                                             Undirected graph
                                Unweighted graph                                              Weighted graph
                        The nodes can be URIs or literals                         The nodes are web data items (wdis)
          The edges can be labeled with properties or with owl:sameAs The edges are labeled with the value of closeness between wdis
                   Connected data are described using RDF                  Connected data are described using the WDI model
                        No distinction between the nodes                          Each node has a different prominence
           Only descriptions referred to the same object are connected   Similar wdis are connected by different closeness values
          Data which are not described using RDF cannot be included             Each kind of web resource can be included


                           Table 1. Comparison between Linked Data and i-cloud


3     The proposed semantic clouding approach

In Figure 1, we show the semantic clouding approach developed for i-cloud construc-
tion. The approach is articulated in three phases: i) modeling of web resources, ii) clas-
sification of web resources, and iii) clouding of web resources.
                                                                                                 Prominence
                                     Clouding of web resources                                    evaluation
      Target entity                                                                              techniques


                                                                                                  Matching
                                   Classification of web resources
                                                                                                 techniques


                                       t
                                       t
                                       t
                                       t                      WDI repository
                                       inverted
                                         index


                                                                                                    WDI
                                     Modeling of web resources
                                                                                                   model


                  bookmarking/                    microblogging/                  RDF(S)/OWL
                   annotation                     social networks                 repositories
                    systems
                                                                                                 Source webs

                                      twitter facebook news feed

                       Tagged                     Microdata                    Semantic Web
                      resources                   resources                      resources


                                  Fig. 1. The semantic clouding approach


Modeling of web resources: WDI model. Building i-clouds by mixing up both ob-
jective and subjective information about a certain target entity requires the capability to
deal with a variety of web resources coming from different webs. For semantic cloud-
ing, all the different web resources are acquired from their respective source web and
they are stored in a support repository, called WDI repository, according to a reference
data model, called WDI model [6], based on the notion of web data item to represent
the metadata featuring the various kinds of web resource.


Classification of web resources. Our semantic clouding approach is based on the
capability of grouping the web data items on the basis of their closeness. The closeness
between two web data items wdii and wdij captures the level of similarity/semantic
relation holding between them and it is represented by a closeness coefficient cc(wdii ,
wdij ) ∈ [0, 1], calculated by comparing wdii and wdij . Such closeness coefficient
cc(wdii , wdij ) is calculated for each possible pair of web data items stored in the WDI
repository using appropriate matching techniques, and the corresponding values are
then used by a hierarchical clustering procedure in order to produce a closeness tree
where each leaf corresponds to a web data item, and inner nodes denote the closeness
coefficient values. To choose the matching techniques to use, we take into account the
nature and the different complexity that can characterize the different web resources,
and consequently, the corresponding web data items. In [4], we address the problem of
matching Semantic Web resources; in [6], we analyze the problem of classifying and
comparing microdata; in [7], we provide specific methods and techniques for organizing
and matching tags extracted from the Social Web. Moreover, in [5, 12], we present a
system for integrating Social and Semantic knowledge in a P2P environment.

Clouding of web resources. The clouding phase is based on the results of the classifi-
cation activity and aims at constructing the appropriate i-cloud organization for a given
target entity by prominence and closeness levels. An i-cloud is formally defined as an
undirected weighted graph IC e = (N, E) associated with a target entity e. A node
ni ∈ N represents a web data item wdii relevant for e, while an edge (ni , nj ) ∈ E
between two nodes ni and nj represents the level of closeness between wdii and wdij .
IC e is equipped with a labeling function ρ : N → [0, 1], that associates each node
ni ∈ N with a value p(ni ) ∈ [0, 1], and a labeling function σ : E → [0, 1], that
associates each edge (ni , nj ) ∈ E with a value c(ni , nj ) ∈ [0, 1]. A value p(ni )
denotes the level of prominence of the web data item wdii in IC e . A high value of
p(ni ) denotes that the web resource corresponding to wdii is very relevant for e. Dif-
ferent techniques are possible for the evaluation of the prominence in an i-cloud and
these techniques can be used alone or in combination. We devise three main categories
of techniques for prominence evaluation, namely provenance-base, target-based, and
popularity-based techniques. A value c(ni , nj ) denotes the level of closeness between
the web data items wdii and wdij in IC e . In particular, c(ni , nj ) is equal to the close-
ness coefficient cc(wdii , wdij ) calculated in the previous phase.

    An example of i-cloud is shown in Figure 2, collecting web resources related to the
target entity “Star Wars”. We can observe that web resources in the i-cloud are not only
those directly related to this popular movie, such as the titles of the six movies of the
Star Wars saga, but also resources that are close to the movie saga even if not directly
matching the target, such as some of the most important characters in the movies. The
dimension of each node in the i-cloud is proportional to the prominence of the corre-
sponding web resource for “Star Wars” and the edges connecting the nodes are labeled
with their closeness degree.


4    Ongoing and future work
We have presented the thesis work we are undergoing for semantic data clouding. On-
going and future work will be devoted to formally define the properties of i-clouds and
the operations that can be applied between different i-clouds (e.g., selection, projection,
join). Furthermore, some preliminary evaluation of our semantic clouding approach
has been performed using data extracted from Delicious 8 , Twitter 9 , and Freebase 10 .
i-clouds are evaluated on the basis of their level of accuracy and by analyzing the depen-
dency between their size (i.e., the number of web data items) and their cohesion (i.e.,
the average level of closeness between web data items). The accuracy of an i-cloud is
defined as its capability to collect web resources which are really relevant with respect
to the given target entity, and it depends on the matching techniques that are used for
 8
   http://www.delicious.com
 9
   http://search.twitter.com
10
   http://www.freebase.com
               wdi(twitter1)
               [luke skywalker]
                                                                           Mark Hamill, Luke Skywalker himself, to appear
                                                                           at Star Wars Clebration V in Orlando
                                                                          16 Jul 2010 06:45:08
                                                                                                                                                   Star Wars
                            0.86            0.7

          wdi(twitter2)
       [princess leila organa]

                                   0.7                                                  wdi(freebase5)                       wdi(freebase4)
                                                                    0.7                      [darth vader]
                                                                                                                                                             wdi(freebase3)
                                         wdi(freebase6)                                                                     [star wars episode i
                                                                                                                                                          [star wars episode ii
                                          [obi-wan kenobi]                                              0.5
                                                                                                                           the phantom menace]
                                                                                                                                                          attack of the clones]
                                                                                                                                           0.788

                                                                                             0.5
                                     wdi(delicious2)
                                   [star wars episode vi                                                        0.6
                                                                                                                                   0.788
                                     return of the jedi]                                                                                             0.813
                                                                            0.64

                                                            0.828                                   wdi(freebase1)                                   wdi(freebase2)
                                                                                                   [star wars episode iv                            [star wars episode iii
                                    wdi(delicious1)                             0.64                    a new hope]                                  revenge of the sith]
                                   [star wars episode v
                                 the empire strikes back]


                                  Fig. 2. Example of i-cloud for the target entity “Star Wars”


clustering web data items. In order to evaluate the quality of our matching techniques,
we exploited the IIMB 2010 dataset 11 and related tools, that are used also for the
international instance matching evaluation contest of the Ontology Alignment Evalua-
tion Initiative (OAEI) 12 . The obtained results show that the accuracy of our matching
tool HMatch 2.0 is significantly higher than the one of a simple string matching algo-
rithm. The effective applicability of the semantic clouding approach in real application
contexts and how it is affected by the number of web data items stored in the WDI
repository is also under study.


References

 1. Bergamaschi, S., Guerra, F., Orsini, M., Sartori, C., Vincini, M.: RELEVANT News: a Se-
    mantic News Feed Aggregator. In: Proc. of the 4th Workshop on Semantic Web Applications
    and Perspectives (SWAP 2007). Bari, Italy (2007)
 2. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International Journal
    on Semantic Web and Information Systems 5(3) (2009)
 3. Bouquet, P., Stoermer, H., Mancioppi, M., Giacomuzzi, D.: OkkaM: Towards a Solution
    to the ”Identity Crisis” on the Semantic Web. In: Proc. of the 3rd Italian Semantic Web
    Workshop. Pisa, Italy (2006)
 4. Castano, S., Ferrara, A., Montanelli, S., Varese, G.: Matching Semantic Web Resources. In:
    Proc. of the 8th Int. Workshop on Web Semantics (WebS 2009) co-located with the 20th
11
     http://www.instancematching.org/oaei/imei2010.html
12
     http://oaei.ontologymatching.org/2010
    Int. Conference on Database and Expert Systems Applications (DEXA 2009). Linz, Austria
    (2009)
 5. Castano, S., Ferrara, A., Montanelli, S., Varese, G.: Semantic Coordination of P2P Collective
    Intelligence. In: Proc. of the Int. ACM Conference on Management of Emergent Digital
    EcoSystems (MEDES 2009). Lyon, France (2009)
 6. Castano, S., Ferrara, A., Montanelli, S., Varese, G.: Similarity-based Classification of Micro-
    data. In: D’Atri, A., Ferrara, M., George, J., Spagnoletti, P. (eds.) Information Technology
    and Innovation Trends in Organizations. Springer-Verlag (2010)
 7. Castano, S., Varese, G.: Building Collective Intelligence through Folksonomy Coordination.
    In: Bessis, N., Xhafa, F. (eds.) Next Generation Data Technologies for Collective Computa-
    tional Intelligence. Springer-Verlag (2011), to appear
 8. Hernández, M., Stolfo, S.: The Merge/Purge Problem for Large Databases. In: Proc. of the
    ACM SIGMOD Int. Conference on Management of Data. San Jose, California, USA (1995)
 9. Isaac, A., Van der Meij, L., Schlobach, S., Wang, S.: An Empirical Study of Instance-Based
    Ontology Matching. In: Proc. of the 6th Int. Semantic Web Conference (ISWC 2007). Busan,
    Korea (2007)
10. Koutrika, G., Bercovitz, B., Ikeda, R., Kaliszan, F., Liou, H., Zadeh, Z., Garcia-Molina, H.:
    Social Systems: Can We Do More Than Just Poke Friends? In: Proc. of the 4th Biennial
    Conference on Innovative Data Systems Research (CIDR 2009). Asilomar, CA, USA (2009)
11. Langegger, A., Wöß, W., Blöchl, M.: A Semantic Web Middleware for Virtual Data Integra-
    tion on the Web. In: Proc. of the 5th European Semantic Web Conference (ESWC 2008).
    Tenerife, Spain (2008)
12. Montanelli, S., Castano, S., Ferrara, A., Varese, G.: Managing Collective Intelligence in
    Semantic Communities of Interest. International Journal of Organizational and Collective
    Intelligence (IJOCI), Special Issue on Collectively Intelligent Information and Knowledge
    Management 1(4) (2010)
13. Nikolov, A., Uren, V., Motta, E., Roeck, A.D.: Handling Instance Coreferencing in the Kno-
    Fuss Architecture. In: Proc. of the 1st ESWC Int. Workshop on Identity and Reference on
    the Semantic Web (IRSW 2008). Tenerife, Spain (2008)
14. Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence: Summarizing
    Online News Topics. Communications of the ACM 48(10) (2005)
15. Tummarello, G., Delbru, R., Oren, E.: Sindice.com: Weaving the Open Linked Data. In:
    Proc. of the 6th Int. Semantic Web Conference (ISWC 2007). Busan, South Korea (2007)