=Paper= {{Paper |id=None |storemode=property |title=Thematic Exploration of Linked Data |pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p11-Castano.pdf |volume=Vol-880 |dblpUrl=https://dblp.org/rec/conf/vlds/CastanoFM11 }} ==Thematic Exploration of Linked Data== https://ceur-ws.org/Vol-880/VLDS-p11-Castano.pdf
                                Thematic Exploration of Linked Data

                     Silvana Castano                                      Alfio Ferrara                     Stefano Montanelli
            Università degli Studi di Milano                 Università degli Studi di Milano        Università degli Studi di Milano
              DICo - Via Comelico, 39 -                         DICo - Via Comelico, 39 -                DICo - Via Comelico, 39 -
                 20135 Milano, Italy                               20135 Milano, Italy                      20135 Milano, Italy
             silvana.castano@unimi.it                              alfio.ferrara@unimi.it              stefano.montanelli@unimi.it


ABSTRACT                                                                                In this context, we propose abstraction and aggregation
Now that a huge amount of data is available in the Linked                            techniques to transform a basic, flat view of a potentially
Data Cloud, providing techniques for its effective explo-                            large set of messy linked data, into an inCloud, that is, a
ration is becoming more and more important. In this paper,                           high-level, thematic view enabling a more effective, theme-
we propose aggregation and abstraction techniques for the-                           driven exploration of the same dataset. Through aggrega-
matic exploration of linked data. These techniques trans-                            tion techniques, we identify clusters of semantically related
form a basic, flat view of a potentially large set of messy                          linked data in a (even large) collection representing the re-
linked data for a given search target, into a high-level, the-                       sponse to a search target. Through abstraction techniques,
matic view called inCloud. In an inCloud, thematic ex-                               we mine suitable essentials capturing the theme dealt with
ploration is guided by few essentials auto-describing their                          a linked data cluster and its relevance for the search tar-
prominence for the search target and by their reciprocal                             get, as well proximity relations reflecting reciprocal degree of
proximity relations.                                                                 closeness between cluster essentials. We motivate the role of
                                                                                     inClouds through a real example of linked data collection ex-
                                                                                     tracted from the Freebase repository considering Van Gogh as
Categories and Subject Descriptors                                                   search target. Moreover, we will describe the construction of
H.5 [Information Systems]: Information Interfaces and                                an inCloud through aggregation and abstraction techniques.
Presentation; H.3 [Information Systems]: Information                                 Finally, we show how inCloud representation can be used for
Storage and Retrieval                                                                thematic browsing and exploration of the underlying linked
                                                                                     data collection.

General Terms                                                                        2.     MOTIVATING EXAMPLE
Linked data aggregation, labeling, and exploration                                      In a common scenario, the user interested in exploring a
                                                                                     linked data repository to satisfy a certain search target usu-
                                                                                     ally has to face a long and loosely-intuitive browsing activity.
1.     INTRODUCTION                                                                  This is due to the inherent flat organization of linked data
   The Linked Data paradigm promoted a new way of ex-                                repositories where the URIs of interest for a given target fre-
posing, sharing, and connecting pieces of data, information,                         quently require the user to follow more than one property
and knowledge on the Semantic Web, based on URIs (Uni-                               link before being explored. In particular, the user explo-
versal Resource Identifier) and RDF (Resource Description                            ration is typically characterized by the following steps:
Framework) [1]. Now that a huge amount of data is available
in the Linked Data Cloud, providing techniques for effective                              • Submission to the repository of a search target (t),
linked data searching, exploration, and visualization is be-                                namely a keyword (or a list of keywords) that describes
coming crucial [7, 9]. In the recent literature, issues related                             the subject of interest for the search. An example of
to linked data exploration are getting more and more im-                                    search target is the name of the famous painter Vincent
portance [8, 12]. One of the most challenging questions is                                  van Gogh.
to provide effective browsing solutions capable to deal with
the inherent flat organization of linked data and to manage                               • Selection of the seed of interest (s), namely an URI
the existing huge-sized repositories storing millions of RDF                                that represents the “point of origin” for the exploration
triples.                                                                                    about the search target. The seed of interest is chosen
                                                                                            from the list of URIs returned by the repository as a
                                                                                            reply to the search target. In the Freebase linked data
                                                                                            repository1 , an example of seed for our target is the
Permission to make digital or hard copies of all or part of this work for                   URI /en/vincent_van_gogh.
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies                • Exploration of the URIs reachable from the seed with
bear this notice and the full citation on the first page. To copy otherwise, to             the aim to get access to more information about the
republish, to post on servers or to redistribute to lists, requires prior specific          search target. This requires the user to submit ap-
permission and/or a fee. This article was presented at:                                     propriate queries to the repository to extract the seed
Very Large Data Search (VLDS) 2011.                                                  1
Copyright 2011.                                                                          http://www.freebase.com/.
                   properties and the URIs directly linked to s through                                                                                                                                                                                                                                     van_gogh is shown in Figure 2. In an inCloud
                   these properties. An example of MQL query for the
                   Freebase repository to extract the artworks directly con-                                                                                                                                                                                                                                     • a circle-box represents a a cluster, namely a group of
                   nected to the seed s = /en/vincent_van_gogh through                                                                                                                                                                                                                                             linked data focused on a specific argument/topic re-
                   the property /visual_art/visual_artist/artworks                                                                                                                                                                                                                                                 lated to the considered seed (e.g., the set of artists
                   is [{ “id”: “/en/vincent van gogh”, “type”: “/visual art/vi-                                                                                                                                                                                                                                    that influenced or have been influenced by van Gogh
                   sual artist”, “/visual art/visual artist/artworks”: {}}].                                                                                                                                                                                                                                       (cluster Cl3 ) or the set of van Gogh artworks about
                                                                                                                                                                                                                                                                                                                   sunflowers (Cluster Cl4 );
  The exploration step can be recursively applied to the
visited URIs to progressively discover further URIs at higher                                                                                                                                                                                                                                                    • a square-box represents an essential, namely a concise
distance from the seed according to the user choices and                                                                                                                                                                                                                                                           and convenient summary of the content of a cluster at a
interests.                                                                                                                                                                                                                                                                                                         glance (e.g., Topic Artwork, Sunflower used to summarize
  Due to the huge number of linked data that is usually con-                                                                                                                                                                                                                                                       the content of Cluster Cl4 ). Clusters in an inCloud
cerned with a search target, a lot of exploration steps are                                                                                                                                                                                                                                                        are also characterized by a prominence value denoting
required to build a (more or less) comprehensive picture of                                                                                                                                                                                                                                                        its level of importance for the target in the framework
the available information about the target. As an example,                                                                                                                                                                                                                                                         of the overall inCloud. Prominence values determine
in Figure 1, we show the set of linked data extracted from                                                                                                                                                                                                                                                         the size of the cluster circles, thus the most prominent
Freebase for the target Vincent van Gogh. In this example, we                                                                                                                                                                                                                                                      cluster in the inCloud of Figure 2 is Cluster Cl1 ;

                                                                                                                                         /fictional_universe/ethnicity_in_fiction                                                                                                                                • an arrow represents a proximity relation between clus-
                                                                                                                                                                        /people/ethnicity




                                                                                    /exhibitions/exhibition
                                                                                                                                 /media_common/lost_work
                                                                                                                                                                                                                                                                                                                   ters/essentials, namely a closeness relationship between
           /common/webpage
                                                                                                         /en/farmhouse_in_provence
                                                                                                                                                                                                                                                                                                                   the themes/topics their represent. The arrow thick-
                                                                                   /en/correspondences_vincent_van_gogh_john_chamberlain
                                                                             /en/the_works_of_vincent_van_gogh
                                                                                                                                                                                                                                                                                                                   ness denotes the degree of proximity between the two
                                                      /en/copies_after_millet_and_others                                                        /en/dutch_people                                          /freebase/equivalent_topic


               /tag/ukguardian/$002Fartanddesign$002Fvan-gogh
                                                              /en/wheat_field_with_cypresses
                                                           /en/letters_1886_90
                                                             /en/vincent_by_himself
                                                                                                   /en/self_portrait_with_bandaged_ear_and_pipe
                                                                                                        /en/self_portrait_dedicated_to_paul_gauguin
                                                                                  /en/peasant_woman_against_a_background_of_wheat
                                                                                                           /en/self_portrait_with_dark_felt_hat_at_the_easel
                                                                                  /en/van_goghs_letters_the_artist_speaks
                                                                                                                                            /en/falling_autumn_leaves
                                                                                                                        /quotationsbook/quote/25663
                                                                                                                                                                                                                           /music/performance_role
                                                                                                                                                                                                                                                                                                                   clusters/essentials connected by the arrow.
                                                                    /en/self_portrait_with_a_straw_hat       /en/irises
                                /en/a_corner_of_montmartre_the_moulin_a_poivre
                                                                         /en/the_real_van_gogh_the_artist_and_his_letters                                     /en/the_night_cafe
                                                                                                                      /en/vase_with_five_sunflowers
                                                                                                         /en/still_life_majolica_with_wildflowers
                                                                                          /en/the_church_at_auvers                                                                                          /business/job_title
                                                                 /en/van_gogh_stained_glass_coloring_book
                                                                           /quotationsbook/quote/25726
                                                                                                  /en/portrait_of_eugene_boch /en/self_portrait_with_pipe_and_glass
                                                                   /book/written_work/en/daubignys_garden


                                                                                                                                                                                                                                                                                                               Aggregation techniques are first employed to enforce a “the-
                                                                       /en/les_arenes                                                                                                                      /fictional_universe/character_occupation
                                                        /en/la_mousme                                 /en/wheat_field_with_crows  /en/spring_in_arles
                                                                                                  /en/self_portrait_with_dark_felt_hat
                          /exhibitions/exhibit                                                                                      /en/self_portrait_in_front_of_the_easel
                                                                                                                   /quotationsbook/quote/3154
                                                                                     /quotationsbook/quote/3153                                       /en/vase_with_three_sunflowers
                                                                                                                        /en/self_portrait_with_pipe_and_straw_hat
                                                                                                                                                        /en/the_olive_trees
                                                                                                 /en/corridor_in_the_asylum  /quotationsbook/quote/24567   /en/the_painter_of_sunflowers
                                                             /en/portrait_of_joseph_roulin
                                                            /en/catalogue_of_the_third_vincent_van_gogh_exhibition              /en/i_am_not_an_adventurer_by_choice_but_by_fate                                     /projects/project_role
                                                                                                                    /en/starry_night_over_the_rhone
                                                                                                                     /en/the_wheat_field /en/color_your_own_van_gogh_paintings                                                       / f il m / f il m _ jo b


                                                                                                                                                                                                                                                                                                            matic” clustering of the initial set of linked data, as de-
                                                     /en/flowering_orchards         /en/boats_on_the_beach
                                                                          /en/les_alyscamps                 /en/cartas_a_theo_cyc                                                   / en / ar t i s t
                                                                                           /en/letters_from_provence/en/entrance_to_the_public_gardens_in_arles
                                                                                   /en/the_van_gogh_album
                                                             /quotationsbook/quote/36550                                       /en/the_drinkers/en/breton_women_and_children
                                   /en/four_cut_sunflowers                                                          /en/conscience_is_a_mans_compass            /en/cafe_terrace_at_night
       /interests/collectable_item            /en/self_portrait_with_straw_hat
                                                   /en/self_portrait_with_bandaged_ear/en/van_gogh_on_art_and_artists                                             /en/double-squares_and_squares
                                                                                                                         /en/pork_butchers_shop_in_arles
                                                                                             /media_common/quotation
                                                                                                               /en/at_eternitys_gate
                                                                                                             /visual_art/artwork                                                                                           /people/profession
                                                       /en/correspondance_generale_tome_2                                                                                                    /en/painter
                                             /en/thatched_cottages_by_a_hill
                                                              /en/lettres_a_son_frere_theo
                                             /en/bedroom_in_arles
                                                                                             /book/book
                                                                    /en/portrait_of_madame_augustine_roulin
                                                                                                             /en/the_starry_night
                                                                                                 /visual_art/art_series

                                                                                         /en/japonaiserie_flowering_plum_tree_after_hiroshige
                                                                              /en/subhash_awchat
                                           /en/a_good_picture_is_equivalent_to_a_good_deed                    /common/topic
                                                                                                                                                  /en/the_old_cemetery_tower_at_nuenen                                            /location/nl_municipality
                                                                                                                                                                                                                                                                                                            scribed in Section 3. Abstraction techniques are then ap-
                                             /en/the_paintings_of_van_gogh                            /en/vincent_van_gogh
                                                                /en/portrait_of_vincent_van_gogh
                                           /en/a_peasant_woman_digging_in_front_of_her_cottage
                                    /en/vincent_van_gogh_famous_lives
                                                                   /en/flower_paintings_giftwrap_paper
                                                    /en/schoolboy_camille_roulin
                                          /en/the_town_hall_at_auvers
                                                                        /en/ivy_two_paintings_by_vincent_van_gogh
                                                                                                          /en/willem_roelofs
                                                                                                                                              /en/view_of_arles_flowering_orchards
                                                                                                                                   /en/vase_with_fifteen_sunflowers
                                                                                                                                    /en/portrait_of_paul_eugene_milliet
                                                                                                                   /en/nude_woman_on_a_bed
                                                                                                                                                                        /en/joan_glass
                                                                                                                                                                         /en/sunflowers
                                                                                                                                                   /quotationsbook/quote/35294
                                                                                                                  /quotationsbook/quote/201 /en/portrait_of_adeline_ravoux
                                                                                                                                     /quotationsbook/quote/35821
                                                                                                                                                                                       /en/zundert
                                                                                                                                                                                                                                 /business/employer

                                                                                                                                                                                                                /location/citytown
                                                                                                                                                                                                                                                           /organization/organization_scope
                                                                                                                                                                                                                                                                                                            plied to synthesize an inCloud over the thematic clusters, as
                                                                                                                                                                                                                                                                                                            described in Section 4.
                                            /en/paintings_watercolours_and_drawings                                                  /en/the_best_way_to_know_god_is_to_love_many_things                                /location/administrative_division
                                                                                                                    /en/wheat_fields_at_auvers_under_clouded_sky
       /film/music_contributor /internet/social_network_user                  /en/there_is_no_blue_without_yellow_and_without_orange
                                                                                         /people/deceased_person
                                                     /en/fishing_in_spring_the_pont_de_clichy_asnires                                                                                                                                                               /government/governmental_jurisdiction
                                                                                /en/joan_mitchell
                                                             /en/portrait_of_camille_roulin
                                                         /en/two_cut_sunflowers                                             /en/claude_monet
                                                                                                                                                             /en/vase_with_twelve_sunflowers
                                                                                                                         /en/self-portraits_by_vincent_van_gogh                                       /location/statistical_region                            /biology/breed_origin
                                                                         /en/the_potato_eaters /influence/influence_node                     /en/portrait_of_dr_gachet
                                                                                    /en/anton_mauve /en/the_red_vineyard
                                                                                                              /quotationsbook/quote/24565                   /en/auvers-sur-oise
                                                                                               /en/purvis_young /exhibitions/exhibition_subject
                                                                                                                                         /fictional_universe/person_in_fiction /location/dated_location
                                                                                                                                /en/the_road_menders
                                                                                 /en/still_life_with_apples_pears_lemons_and_grapes
                                                                                                                            /visual_art/art_subject                                                                           /organization/organization_member
 /music/group_member
              /music/guitarist                                      /en/dick_bruna        /en/the_bedroom
                                                                                /en/alexis_preller           /en/jean-francois_millet
                                                                                                        /people/person                                                                                                                             /royalty/kingdom
                                                               /en/garret_schuelke          /en/the_poets_garden /quotationsbook/quote/8724       /book/book_subject
                                                       /en/billy_childish
                                                                        /en/chuck_connelly              /visual_art/visual_artist                                                                       /en/netherlands
                                                                                           /en/franz_marc          /en/willem_de_kooning
                                                                                                                /en/hai_zi                                                                                              /military/military_combatant
                                                                                                                            /en/the_roulin_family
                                                                                                                                                     /en/impressionism
                                                                                                                                       /en/post-impressionism
                   /award/award_nominee

         /music/artist

                                /film/actor
                                                                                  /en/arman
                                                                                                /en/portrait_of_dr_gachet_first_version

                                                                                    / b o o k / /film/film_subject
                                                                                                author
                                                                                                                        /en/kingdom_of_the_netherlands
                                                                                                             /en/paul_cezanne

                                                                                                                                     /en/suicide
                                                                                                                        /en/henri_matisse
                                                                                                        /en/peter_paul_rubens
                                                                                                                                                     /en/drawing
                                                                                                                                          /en/painting
                                                                                                                                                                 /location/location
                                                                                                                                                                                                                              /government/government

                                                                                                                                                                                                                                       /food/beer_country_region
                                                                                                                                                                                                                      /aviation/aircraft_owner
                                                                                                                                                                                                      /organization/organization_founder
                                                                                                                                                                                                                                                                                                            3.    LINKED DATA AGGREGATION
                                                                                                                        /en/firearm                                          /media_common/netflix_genre
                                                                                                                                                             /book/illustrator



            /visual_art/art_owner
                                                                 /en/yves_saint-laurent

                                         /military/military_person
                   /royalty/chivalric_order_member                                                                                         /interests/collection_category
                                                                                                                                                                               /time/event           /location/country
                                                                                                                                                                                                                      /olympics/olympic_participating_country


                                                                                                                                                                                                                                              /sports/sports_team_location
                                                                                                                                                                                                                                                                                                               The goal of aggregation techniques is to transform an ini-
                                                                                                                                                               /media_common/quotation_subject


                             /film/film_costumer_designer
                 /fashion/fashion_designer
                                                                                                    /en/male


                                                                                        /projects/project_participant
                                                                                                                        /en/expressionism


                                                                                                                /people/cause_of_death
                                                                                                                                         /visual_art/art_period_movement
                                                                                                                                                                                   /visual_art/visual_art_form


                                                                                                                                                                                             /internet/website_category
                                                                                                                                                                                                                                     /sports/sport_country
                                                                                                                                                                                                                                                                                                            tial set of linked data into a number of thematic clusters.
                                   /film/person_or_entity_appearing_in_film
                               /fashion/fashion_label
                                                                                                                                    /award/award_winner
                                                                                                                                                                  /education/field_of_study
                                                                                                                                                                                                                                                                                                            The starting point is a RDF graph Gs containing the linked
                                                                                                              /sports/sports_equipment                                /cvg/cvg_genre

                                                                 /people/gender
                                                                         /medicine/risk_factor
                                                                                              /chess/chess_player
                                                                                                                                             /fictional_universe/character_powers
                                                                                                                              /media_common/media_genre
                                                                                                                                                                                                                                                                                                            data about a certain seed s of interest automatically ex-
                                                                                    /biology/hybrid_parent_gender

                                                            /fictional_universe/character_gender
                                                                                                                                                /film/film_genre
                                                                                                                /architecture/architectural_style

                                                                                             /fictional_universe/character_species
                                                                                                                          /book/school_or_movement
                                                                                                                                                                                                                                                                                                            tracted from a Linked Data repository R. Appropriate ex-
                                                                                        /freebase/task

                                                                                                                                                                                                                                                                                                            traction queries are defined to this end according to the lan-
                                                                                                                                                                                                                                                                                                            guage (e.g., SPARQL, MQL) supported by the repository
                                                                                                                                                                                                                                                                                                            R. These queries generally enforce the following extrac-
                                                                                                                                                                                                                                                                                                            tion/filtering operations:
Figure 1: A graph of linked data extracted from the
Freebase repository about the search target Vincent van                                                                                                                                                                                                                                                          • Extraction of properties and corresponding values within
Gogh                                                                                                                                                                                                                                                                                                               a distance ≤ d from the seed s. We consider that an
                                                                                                                                                                                                                                                                                                                   URI in the repository R is concerned with the seed s if
considered the seed s = /en/vincent_van_gogh, we explored                                                                                                                                                                                                                                                          there is a property path of length ≤ d between the URI
the complete set of directly linked URIs and some selected                                                                                                                                                                                                                                                         and s. The distance d can be dynamically changed and
URIs at distance d = 2 from s. As it is clear from this simple                                                                                                                                                                                                                                                     it has an impact on the number of extracted linked
example, exploring such a flat and huge collection of data                                                                                                                                                                                                                                                         data and thus on the size of the resulting RDF graph.
is cumbersome. First, because the representation is flat and                                                                                                                                                                                                                                                       In usual scenarios, a distance d = 2 is a good trade-off
it is impossible to immediately understand whether some                                                                                                                                                                                                                                                            to obtain a sufficient number of linked data about s
URIs are more important than others. Moreover, possible                                                                                                                                                                                                                                                            and a well-sized RDF graph.
sets of URIs addressing the same/similar argument about
the target are not highlighted nor grouped.                                                                                                                                                                                                                                                                      • Extraction of the URI types. For each URI within a
                                                                                                                                                                                                                                                                                                                   distance ≤ d from the seed s, we extract the list of
   The solution we propose is based on aggregation and ab-                                                                                                                                                                                                                                                         types (i.e., classes) the URI belongs to. The appropri-
straction techniques to transform a basic, flat view of linked                                                                                                                                                                                                                                                     ate property of the repository R is exploited to this
data like the one in Figure 1, into an inCloud providing a                                                                                                                                                                                                                                                         end (e.g., the property type in Freebase).
high-level, thematic view of the same data. inClouds are
conceived to be coupled with the conventional query inter-                                                                                                                                                                                                                                                       • Filtering of non-relevant properties. Loosely meaning-
faces of the existing Linked Data repositories, in that they                                                                                                                                                                                                                                                       ful properties of a repository, like the property image
can be built on top of an extracted dataset to provide a more                                                                                                                                                                                                                                                      of Freebase, can be excluded from the resulting RDF
effective presentation of the result.                                                                                                                                                                                                                                                                              graph since they are poorly useful in providing infor-
   An example of inCloud for the seed s = /en/vincent_                                                                                                                                                                                                                                                             mation about s.
                                                                                                                                                      inCloud for Vincent Van Gogh
                                                             Cl1                                                               Cl2
                                                                                                                                                         (/en/vincent_van_gogh)
                                                                                                 Topic Artwork              /en/vincent_
                      Written Work Book                     /en/letters_from_                                              van_gogh (20)
                                                              provence (3)                                                                                                  ESSENTIAL
                                                                                                     Portrait                 /en/portrait_of_
                           Van Gogh                            /en/vincent_by_                                               camille_roulin (7)
                                                                                                                                                                            THEMATIC CLUSTER
                                                                  himself (3)                                /en/portrait_of_eugene_boch (5)
                                      /en/the_works_of_vincent_van_gogh (3)                                      /en/portrait_of_adeline_ravoux (5)                         PROXIMITY LINK
                                                                                                                /en/the_painter_of_sunflowers (3)
                                     /en/color_your_own_van_gogh_paintings (3)
                                                                                                                      /en/self_portrait_with_
                                                                                                                        bandaged_ear (3)
                                        /en/van_gogh_on_art_and_artists (3)
                                                                                                                               ...

                                             /en/boats_on_the_beach (3)

                                                             ...                                                                                                         Topic Artwork


                                                                                   Topic Artwork                                                                               Garden
                                             Cl3
                                                                                                                                                        /en/the_starry_night (1)
                                                                                      Sunflower
                                                                                                                                                      /en/farmhouse_in_provence (1)
                                   /en/willem_roelofs (3)
                                                                                      /en/the_painter_of_sunflowers (2)                               /en/the_painter_of_sunflowers (1)
                                   /en/vincent_van_gogh (3)
                                                                                                                                                      /en/the_olive_trees (1)
                          /en/anton_mauve (3)                                     //en/vase_with_three_sunflowers (2)
                                                                                                                                                              /en/spring_in_arles (1)
                                        /en/alexis_preller (3)                        /en/vase_with_twelve_sunflowers (2)                                     /en/irises (1)
                         /en/purvis_young (3)                                                                                                                 ...
                                                                                 /en/vase_with_fifteen_sunflowers (2)
                                            /en/joan_glass (3)
                             ...
                                                                                            /en/two_cut_sunflowers (2)
                                                                                                                                                                      Cl5
                   Influence Node & Person &                                          /en/four_cut_sunflowers (2)
                Visual Artist & Deceased Person
                                                                                                    ...

                   Joan Willem Paul Roelofs
                       Mauve Arman ...                                                             Cl4




      Figure 2: An example of inCloud extracted from the Freebase repository for the seed /en/vincent van gogh


   The query result is the graph Gs = (Ns , Es ) where a node                                              matching metric that considers the structure of the terms
n ∈ Ns , called linked data entity, can be an URI, a literal,                                              termx and termy . For σ calculation, we employ our match-
or a type value that satisfy the query selection, and an edge                                              ing system HMatch 2.0, where state-of-the-art metrics for
e (ni , nj ) ∈ Es , called property link, represents a property                                            string matching (e.g., I-Sub, Q-Gram, Edit-Distance, and Jaro-
relationship of R between the nodes ni , nj ∈ Ns .                                                         Winkler) are implemented [2]. A similarity link e (ni , nj ) is
                                                                                                           established between the linked data entities ni and nj iff
  Based on the RDF graph Gs , linked data aggregation is                                                   σ(ni , nj ) ≥ th where th ∈ (0, 1] is a matching threshold de-
articulated in two main steps, namely similarity evaluation                                                noting the minimum level of similarity required to consider
and thematic clustering.                                                                                   two linked data entities as matching entities.

3.1     Similarity evaluation                                                                              3.2            Thematic aggregation
   This step has the goal to analyze the graph Gs and to                                                      This step has the goal to analyze the graph Gs+ obtained
generate an augmented linked data graph Gs+ where a sim-                                                   through similarity evaluation and to identify/mine a set
ilarity link is added between each pair of matching linked                                                 CL of thematic clusters. Given a graph Gs+ , a cluster Cl
data entities in Ns . To this end, the level of affinity be-                                               = {(n1 , f1 ) , . . . , (nh , fh )} is a set of linked data entities
tween the entities of Ns is evaluated as follows. Given two                                                n1 , . . . , nh ∈ Ns that are more similar to each other than
linked data entities ni , nj ∈ Ns , the linked data affinity                                               to the other entities of Ns . Each entity nj belonging to Cl
σ(ni , nj ) ∈ [0, 1] denotes the level of similarity of ni and                                             is associated with a corresponding frequency fj which de-
nj based on the commonalities of their terminological equip-                                               notes the number of occurrences of nj in Cl.
ments. Each linked data entity n ∈ Ns is associated with
a terminological equipment Termn = {term1 , . . . , termm }                                                   Clusters are determined by exploiting the graph Gs+ and
where termj , with 1 ≤ j ≤ m, is a term appearing in the                                                   by detecting those node regions that are highly intercon-
label of a node adjacent to n in Gs , or a term appearing                                                  nected through property/similarity links. The problem of
in the label of n itself. Before inclusion in a terminological                                             thematic aggregation is analogous to the problem of cluster
equipment, each term is submitted to a normalization pro-                                                  calculation, also known as module, community, or cohesive
cedure for word-lemma extraction and for compound-term                                                     group, in graph theory. For this reason, for thematic aggre-
tokenization [4, 15].                                                                                      gation, we rely on a clique percolation method (CPM) [13].
   The affinity σ of two linked data entities ni , nj ∈ Ns is                                              The CPM is based on the notion of k-clique which corre-
calculated as the Dice coefficient over their terminological                                               sponds to a complete (fully-connected) sub-graph of k nodes
equipments as follows:                                                                                     within the graph Gs+ . Two k-cliques are defined as adjacent
                                2· | termx ∼ termy |                                                       k-cliques if they share k − 1 nodes. The CPM determines
             σ(ni , nj ) =                                                                                 clusters from k-cliques. In particular, a cluster, or more
                              | Termni | + | Termnj |
                                                                                                           precisely, a k-clique-cluster, is defined as the union of all k-
where termx ∼ termy denotes that termx ∈ Termni and                                                        cliques that can be reached from each other through a series
termy ∈ Termnj are matching terms according to a string                                                    of adjacent k-cliques. As a consequence, a typical k-clique-
cluster is composed of several cliques (with size ≤ k) that       In order to represent this fact, we introduce the notion of
tend to share many of their nodes. Since the cliques of a         prominence of a cluster, namely a value Pi ∈ [0, 1]. The
graph can share one or more nodes, we observe that a node         higher Pi is, the higher is also the prominence of Cli in the
can belong to several clusters, and thus clusters can over-       inCloud. In our approach, the level of prominence of a clus-
lap. In our approach, we employ the CPM implemented               ter is higher when the cluster is very focused on its theme
in the CFinder tool2 . Although the determination of the full     and its contents are homogeneous. In particular, we formal-
set of cliques of a graph is widely believed to be a non-         ize two cluster properties that are variability and density.
polynomial problem, CFinder proves to be efficient when ap-          Variability vi is the degree of overlap among the cliques
plied to graphs like those considered in our approach. Such       of the cluster Cli . For a linked data entity nj ∈ Ns+ , we
an algorithm is based on first locating all complete sub-         call fj the frequency of nj , that is the number of cliques of
graphs of Gs+ that are not part of larger complete subgraphs,     Cli that contain nj . Variability vi is measured by a coeffi-
and then on identifying existing k-clique-clusters by carry-      cient of variation, which is the ratio between the standard
ing out a standard component analysis of the clique-clique        deviation of the linked data entity frequencies in Cli and
overlap matrix [6]. As a result, CFinder produces the full set    the arithmetic mean of those frequencies, as follows (with f
CL of k-clique-clusters existing in the graph Gs+ for all the     denoting the arithmetic mean value of frequencies):
possible values of k. A linked data entity ni belonging to
a cluster Cl ∈ CL is represented as a pair (ni , fi ) where                            v
                                                                                       u        Ni
the frequency value fi denotes the number of cliques of Cl                            1u 1     X
                                                                                  vi = t           (fi − f )2
which the entity nj belongs to (see Example of Figure 2).                             f Ni − 1 i=1
The entities of a cluster are represented with different sizes,
proportional to the corresponding frequency values accord-           According to this definition, high values of vi denote a
ing to a visualization manner “à la tag-cloud”3 .                low degree of overlap in the cliques of the cluster Cli , while
                                                                  low values of vi denote a high degree of overlap in the Cli
                                                                  cliques.
4.    LINKED DATA ABSTRACTION                                        Density di of a cluster Cli is the degree of interconnection
  The goal of linked data abstraction techniques is to build      among the linked data entities of Cli . The density coeffi-
an inCloud, namely a high-level view on top of linked data        cient di = 2 · Ri /Ni (Ni − 1) is the ratio between the number
clusters by synthesizing them through essentials. inCloud         Ri of links in the cluster Cli and the maximum number of
clusters are also featured by a level of prominence and by        possible links. According to this definition, high values of di
proximity relations that denote the level of overlapping of       denote a high degree of interconnection among the cluster
the different clusters.                                           Cli entities, while low values of di denote a low degree of
                                                                  interconnection. The prominence Pi of a cluster Cli is cal-
4.1    Essential abstraction                                      culated on the basis of its variability and density as follows:
  An essential Essi is a concise and convenient summary of
a thematic cluster Cli and it is defined as a pair of the form                               2 · (1 − vi ) · di
Essi = (Ci , Di ) where Ci is the category associated with                            Pi =
                                                                                              (1 − vi ) + di
Cli and Di is a descriptor associated with Cli . A category
Ci is a set composed by the labels of the most frequent              According to this approach, most prominent clusters are
types of the linked data entities in Cli , while a descriptor     those which are more focused and homogeneous with respect
Di is a set composed by the most frequent terms in the            to their theme. We graphically represent cluster prominence
terminological equipments of the entities in Cli . If more        by drawing circles proportional to the prominence values of
than one most equally-frequent type and/or term exist, they       the corresponding clusters. In our example of Figure 2, clus-
are all inserted in Ci and Di , respectively. In the example      ters Cl1 and Cl4 are more prominent (larger circles) because
of Figure 2, the cluster Cl4 corresponds to a very focused        they are more focused and homogeneous. On the opposite,
theme expressed by the essential category Topic Artwork (the      clusters like Cl3 , which collect several entities of different
most frequent type of the entities in the cluster) and by         types are considered less prominent (smaller circle). How-
the essential descriptor Sunflower (the most frequent term in     ever, other options are possible for the evaluation of promi-
the terminological equipments of the entities in Cl4 ). In        nence in case of specific application needs. A first option
cases where many entities are equally frequent in a cluster,      is to consider a cluster to be more prominent as it is more
the abstracted essential is less focused and contains more        close to the seed s of interest. In this case, the prominence
terms. This is the case for example of the cluster Cl3 of         Pi of a cluster Cli is evaluated by taking into account the
Figure 2, representing persons and visual artists influenced      average value of similarity between the linked data entities
by Van Gogh. In this case, the most frequent terms used           in the cluster Cli and s, weighted by the frequency of each
as descriptors are the names of the people involved in the        entity ni in Cli , as follows:
cluster, which are all equally frequent in the cluster.
                                                                                             Ni
4.2    Prominence evaluation
                                                                                             P
                                                                                                   σ(ni , s) · fi
                                                                                             p=1
   Clusters (and related essentials) in an inCloud are dif-                           Pi =          Ni
ferently relevant with respect to the original search target.                                       P
                                                                                                          fi
                                                                                                    p=1
2
 Available at http://www.cfinder.org/.
3
 For a more readable visualization of highly-populated clus-      where fi denotes the frequency of the linked data entity ni
ters, the representation of less-frequent linked data entities    in the cluster Cli . Another option is to consider the promi-
can be omitted.                                                   nence Pi of a cluster Cli as proportional to the dimension
Ni of Cli and to the size ki of the smaller clique in Cli , as     tension to the multi-repository exploration and to the multi-
follows: Pi = 2 · Ni · ki /Ni + ki .                               seed extraction can be performed.

4.3     Proximity relations                                        Extension to multi-repository exploration. For a more
  In an inCloud, clusters (and consequently their associated       complete visualization of the available linked data about
essentials) are connected by reciprocal proximity relations,       a certain search target, multiple RDF repositories can be
which represent the degree of overlapping between them.            queried to originate a unique, comprehensive inCloud. In
In particular, given two clusters Cli and Clj , the degree of      the Linked Data Cloud, the property owl:sameAs is used to
proximity Xij =| Cli ∩ Clj | / | Cli | between Cli and Clj is      denote when a linked data entity ni belonging to a certain
proportional to the number of linked data entities common          RDF repository R and another entity nj belonging to a dif-
to Cli and Clj over the number of linked data entities in Cli .    ferent repository R0 refer to the same real-world object. In
The greater the level of overlapping between Cli and Clj ,         a multi-repository scenario, the construction of the graph
the higher the degree of their proximity relation. Proximity       Gs can take into account the owl:sameAs relations as a sort
relations are graphically represented by arrows with thick-        of “natural join” operation. The idea is to start the con-
ness proportional to the proximity degree. In Figure 2, we         struction of Gs by querying an initial repository R and to
can see how proximity relations connect those clusters that        exploit the owl:sameAs relations to extend the linked data ex-
are more semantically related to each other, such as Cl2 ,         traction to other RDF repositories. In particular, the URIs
Cl4 , and Cl5 which all represent different types of artworks      connected by a owl:sameAs relation are collapsed in a unique
by Vincent van Gogh.                                               linked data entity of Gs and the extraction/filtering opera-
                                                                   tions described in Section 3 are applied to the whole set of
5.    USING INCLOUDS FOR THEMATIC EX-                              linked data extracted by the considered RDF repositories.
      PLORATION                                                    Extension to multi-seed extraction. In some cases, the
  In this section, we discuss how inClouds can be exploited        user can be interested in exploring the available linked data
for thematic exploration of linked data and we provide some        about more than one seed of interest. In this framework,
considerations about the applicability of the inCloud ap-          the inCloud mechanism can be used to build a comprehen-
proach in the large-scale scenario.                                sive thematic picture that takes into account all the seeds
                                                                   of interest. In a multi-seed scenario, the starting point is a
5.1     Thematic exploration through inClouds                      set of seeds S = {s1 , . . . , sk }. The graph Gs is built by ex-
  An inCloud enables different exploration modalities that         ecuting the extraction/filtering operations of Section 3 for
can be switched on according to the specific user preferences.     each element si ∈ S. Depending on the seeds of interest,
In particular, the following modalities are defined.               one or more portions of the graph Gs can be disjoint from
                                                                   the rest of the graph. In particular, when the seeds in S
     • Exploration-by-essential. This is the most intuitive ex-    are about completely different arguments, a separate inde-
       ploration modality and it is based on cluster essentials.   pendent cluster is generated through aggregation for each
       A user can consider each essential as a sort of instanta-   si ∈ S. In such a limit case, the usefulness of the inCloud
       neous picture of the associated cluster and linked data     mechanism for exploration is in the capability of providing
       therein contained, thus allowing the user to rapidly        an effective synthetic essential for each seed si ∈ S and in
       choose the most preferred one for starting the explo-       calculating the relative prominence of each seed with respect
       ration.                                                     to the others.
     • Exploration-by-prominence. This modality allows the         We stress that linked data exploration in-the-large can re-
       user to organize the exploration according to the promi-    quire the execution of thematic aggregation techniques over
       nence values associated with the clusters. The idea is      a starting RDF graph Gs containing a huge number of nodes
       to support the user in moving throughout the clusters       (e.g., thousands of linked data entities). The clique perco-
       according to their relevance with respect to the set of     lation method we use for cluster calculation best performs
       considered linked data. As discussed in Section 4, dif-     when a small-medium number of nodes in the graph Gs is
       ferent criteria can be used to calculate the prominence     considered (e.g., hundreds of linked data entities). For ex-
       value. The capability to switch from one criterion to       ample, in our tests, the CPM over a graph Gs containing
       another allows the user to dynamically re-organize the      200 nodes takes an execution time of 200ms (considering a
       inCloud in light of a different notion of cluster promi-    matching threshold th=0.9). For linked data exploration in-
       nence.                                                      the-large, when 1.000 (or more) nodes are considered, more
     • Exploration-by-proximity. This modality enables the         efficient clustering algorithms, like hierarchical clustering,
       user to choose a cluster and to browse its constella-       can be exploited (see [3] for further details).
       tion, by exploiting the proximity relations. When a
       user is exploring a certain cluster, the proximity re-      6.   RELATED WORK
       lations provide indication of its fully/partially over-       Problems and solutions more strictly related to our work
       lapping neighbors, thus suggesting the possible explo-      are focused either on improving search and retrieval of in-
       ration of clusters that are somehow related in content.     formation in the Linked Data cloud [14] or on browsing
                                                                   and presentation of linked data contents [5]. Search and
5.2     Linked data exploration in-the-large                       retrieval is moving from traditional information lookup to
  The presented inCloud approach can be also exploited for         exploratory search, defined as the activity of finding and
applicability in the large scale scenario. In particular, ex-      understanding knowledge about a topic of interest by ex-
ploiting aggregation and learning of information in a so-           [2] S. Castano, A. Ferrara, and S. Montanelli. Matching
cial context [11]. In this respect, for example, Sig.ma (Se-            Ontologies in Open Networked Systems: Techniques
mantic Information MAshup) [16] retrieves and integrates                and Applications. Journal on Data Semantics,
linked data, starting from a single URI, by querying the                V:25–63, 2006.
Web of Data and applying machine learning to the data               [3] S. Castano, A. Ferrara, and S. Montanelli. Structured
found. In a similar direction, structured and collabora-                Data Clouding across Multiple Webs. Technical
tive search engines are being emerging as a promising so-               report, Università degli Studi di Milano, 2011.
lution for presenting the query results in a sort of struc-         [4] S. Castano and G. Varese. Next Generation Data
tured form and focusing on the understanding of the user                Technologies for Collective Computational Intelligence,
information need. Examples in this field are Wolfram Al-                chapter Building Collective Intelligence through
pha (http://www.wolframalpha.com), Google Wonder Wheel                  Folksonomy Coordination, pages 87–112. Springer,
(http://www.googlewonderwheel.com), and YAGO2 (http:                    2011.
//www.mpi-inf.mpg.de/yago-naga/yago). Another cate-                 [5] S. Davies, J. Hatfield, C. Donaher, and J. Zeitz. User
gory of related work includes approaches aiming at present-             Interface Design Considerations for Linked Data
ing linked data in a more intuitive way. Examples of solu-              Authoring Environments. In Proc. of the WWW Int.
tions in this respect are [8, 12] and Freebase Parallax (http://        Workshop on Linked Data on the Web (LDOW 2010),
www.freebase.com/labs/parallax/), where tools that help                 Raleigh, NC, USA, 2010.
users in exploring DBpedia and Freebase are presented, not          [6] B. Everitt. Cluster Analysis. Edward Arnold, London,
only via directed links in the RDF dataset, but also via                UK, 3rd edition, 1993.
newly discovered knowledge associations and visual naviga-          [7] W. Halb, Y. Raimond, and M. Hausenblas. Building
tion. These tools exploit aggregation techniques in order               Linked Data for both Humans and Machines. In Proc.
to combine related topics in unified nodes, providing also a            of the WWW Int. Workshop on Linked Data on the
textual description of each node. In other approaches, like             Web (LDOW 2008), Beijing, China, 2008.
Marbles (http://www5.wiwiss.fu-berlin.de/marbles) and
                                                                    [8] C. Hirsch et al. Interactive Visualization Tools for
LESS (http://less.aksw.org), information about resources
                                                                        Exploring the Semantic Graph of Large Knowledge
of interest is presented exploiting HTML and RSS and by                 Spaces. In Proc. of the IUI Int. Workshop on Visual
using different colors to distinguish sources.                          Interfaces to the Social and the Semantic Web,
   With respect to the related work, our contribution regards           Sanibel Island, USA, 2009.
the use of data similarity, proximity, and prominence tech-
                                                                    [9] A. Hogan, A. Harth, A. Passant, S. Decker, and
niques for inCloud construction, to move from a basic, flat
                                                                        A. Polleres. Weaving the Pedantic Web. In Proc. of
organization of linked data to a high-level, thematic view of
                                                                        the WWW Int. Workshop on Linked Data on the Web
them. Moreover, the proposed techniques allow the differ-
                                                                        (LDOW 2010), Raleigh, NC, USA, 2010.
ent themes/topics to directly emerge from the original linked
data and their mutual links, by suggesting also an intuitive       [10] A. Leclercq. he perceptual evaluation of information
visualization of data contents in terms of essentials, which            systems using the construct of user satisfaction: case
synthesize the contents of thematic clusters.                           study of a large french group. ACM SIGMIS
                                                                        Database, 38(2):27–60, 2007.
                                                                   [11] G. Marchionini. Exploratory Search: from Finding to
7.   CONCLUDING REMARKS                                                 Understanding. Communications of the ACM,
   In this paper, we presented inClouds, high-level views of            49(4):41–46, 2006.
linked data enabling their thematic exploration. Ongoing           [12] R. Mirizzi, A. Ragone, T. Di Noia, and E. Di Sciascio.
work is focused on finalizing the development of a web ap-              Semantic Wonder Cloud: Exploratory Search in
plication fully covering the steps of linked data aggregation           DBpedia. In Proc. of the ICWE 2nd Int. Workshop on
and abstraction required for inCloud construction. By ex-               Semantic Web Information Management (SWIM
ploiting an initial prototype implementation, we run some               2010), pages 138–149, Vienna, Austria, 2010.
experiments concerning user evaluation of inClouds based
                                                                   [13] G. Palla, I. Derényi, I. Farkas, and T. Vicsek.
on standard user-oriented evaluation methods for interac-
                                                                        Uncovering the Overlapping Community Structure of
tive web search interfaces and systems [10]. Initial results
                                                                        Complex Networks in Nature and Society. Nature,
are promising and inClouds are seen by real users as a valid
                                                                        435:814–818, 2005.
support to the satisfaction of users information needs [3].
                                                                   [14] D. Petrelli, S. Mazumdar, A. Dadzie, and
Moreover, ongoing research activity regards the extension
                                                                        F. Ciravegna. Multi Visualization and Dynamic Query
of the inCloud approach to consider additional kinds of
                                                                        for Effective Exploration of Semantic Data. In Proc. of
web data contents, like microdata, microblogging posts, and
                                                                        the 8th Int. Semantic Web Conference, pages 505–520,
news. The idea is to propose inClouds as a comprehensive
                                                                        Chantilly, VA, USA, 2009.
exploration tool considering also actual, up-to-date social
web information about the search target for possible fruition      [15] S. Sorrentino et al. Schema Normalization for
in the framework of event-promoting applications.                       Improving Schema Matching. In Proc. of the 28th Int.
                                                                        ER Conference, pages 280–293, Gramado, Brazil,
                                                                        2009.
8.   REFERENCES
 [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data -         [16] G. Tummarello et al. Sig. ma: Live Views on the Web
     The Story So Far. Int. Journal on Semantic Web and                 of Data. Web Semantics: Science, Services and Agents
     Information Systems, 5(3):1–22, 2009.                              on the World Wide Web, 8(4):355–364, 2010.