=Paper=
{{Paper
|id=None
|storemode=property
|title=Thematic Exploration of Linked Data
|pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p11-Castano.pdf
|volume=Vol-880
|dblpUrl=https://dblp.org/rec/conf/vlds/CastanoFM11
}}
==Thematic Exploration of Linked Data==
Thematic Exploration of Linked Data Silvana Castano Alfio Ferrara Stefano Montanelli Università degli Studi di Milano Università degli Studi di Milano Università degli Studi di Milano DICo - Via Comelico, 39 - DICo - Via Comelico, 39 - DICo - Via Comelico, 39 - 20135 Milano, Italy 20135 Milano, Italy 20135 Milano, Italy silvana.castano@unimi.it alfio.ferrara@unimi.it stefano.montanelli@unimi.it ABSTRACT In this context, we propose abstraction and aggregation Now that a huge amount of data is available in the Linked techniques to transform a basic, flat view of a potentially Data Cloud, providing techniques for its effective explo- large set of messy linked data, into an inCloud, that is, a ration is becoming more and more important. In this paper, high-level, thematic view enabling a more effective, theme- we propose aggregation and abstraction techniques for the- driven exploration of the same dataset. Through aggrega- matic exploration of linked data. These techniques trans- tion techniques, we identify clusters of semantically related form a basic, flat view of a potentially large set of messy linked data in a (even large) collection representing the re- linked data for a given search target, into a high-level, the- sponse to a search target. Through abstraction techniques, matic view called inCloud. In an inCloud, thematic ex- we mine suitable essentials capturing the theme dealt with ploration is guided by few essentials auto-describing their a linked data cluster and its relevance for the search tar- prominence for the search target and by their reciprocal get, as well proximity relations reflecting reciprocal degree of proximity relations. closeness between cluster essentials. We motivate the role of inClouds through a real example of linked data collection ex- tracted from the Freebase repository considering Van Gogh as Categories and Subject Descriptors search target. Moreover, we will describe the construction of H.5 [Information Systems]: Information Interfaces and an inCloud through aggregation and abstraction techniques. Presentation; H.3 [Information Systems]: Information Finally, we show how inCloud representation can be used for Storage and Retrieval thematic browsing and exploration of the underlying linked data collection. General Terms 2. MOTIVATING EXAMPLE Linked data aggregation, labeling, and exploration In a common scenario, the user interested in exploring a linked data repository to satisfy a certain search target usu- ally has to face a long and loosely-intuitive browsing activity. 1. INTRODUCTION This is due to the inherent flat organization of linked data The Linked Data paradigm promoted a new way of ex- repositories where the URIs of interest for a given target fre- posing, sharing, and connecting pieces of data, information, quently require the user to follow more than one property and knowledge on the Semantic Web, based on URIs (Uni- link before being explored. In particular, the user explo- versal Resource Identifier) and RDF (Resource Description ration is typically characterized by the following steps: Framework) [1]. Now that a huge amount of data is available in the Linked Data Cloud, providing techniques for effective • Submission to the repository of a search target (t), linked data searching, exploration, and visualization is be- namely a keyword (or a list of keywords) that describes coming crucial [7, 9]. In the recent literature, issues related the subject of interest for the search. An example of to linked data exploration are getting more and more im- search target is the name of the famous painter Vincent portance [8, 12]. One of the most challenging questions is van Gogh. to provide effective browsing solutions capable to deal with the inherent flat organization of linked data and to manage • Selection of the seed of interest (s), namely an URI the existing huge-sized repositories storing millions of RDF that represents the “point of origin” for the exploration triples. about the search target. The seed of interest is chosen from the list of URIs returned by the repository as a reply to the search target. In the Freebase linked data repository1 , an example of seed for our target is the Permission to make digital or hard copies of all or part of this work for URI /en/vincent_van_gogh. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies • Exploration of the URIs reachable from the seed with bear this notice and the full citation on the first page. To copy otherwise, to the aim to get access to more information about the republish, to post on servers or to redistribute to lists, requires prior specific search target. This requires the user to submit ap- permission and/or a fee. This article was presented at: propriate queries to the repository to extract the seed Very Large Data Search (VLDS) 2011. 1 Copyright 2011. http://www.freebase.com/. properties and the URIs directly linked to s through van_gogh is shown in Figure 2. In an inCloud these properties. An example of MQL query for the Freebase repository to extract the artworks directly con- • a circle-box represents a a cluster, namely a group of nected to the seed s = /en/vincent_van_gogh through linked data focused on a specific argument/topic re- the property /visual_art/visual_artist/artworks lated to the considered seed (e.g., the set of artists is [{ “id”: “/en/vincent van gogh”, “type”: “/visual art/vi- that influenced or have been influenced by van Gogh sual artist”, “/visual art/visual artist/artworks”: {}}]. (cluster Cl3 ) or the set of van Gogh artworks about sunflowers (Cluster Cl4 ); The exploration step can be recursively applied to the visited URIs to progressively discover further URIs at higher • a square-box represents an essential, namely a concise distance from the seed according to the user choices and and convenient summary of the content of a cluster at a interests. glance (e.g., Topic Artwork, Sunflower used to summarize Due to the huge number of linked data that is usually con- the content of Cluster Cl4 ). Clusters in an inCloud cerned with a search target, a lot of exploration steps are are also characterized by a prominence value denoting required to build a (more or less) comprehensive picture of its level of importance for the target in the framework the available information about the target. As an example, of the overall inCloud. Prominence values determine in Figure 1, we show the set of linked data extracted from the size of the cluster circles, thus the most prominent Freebase for the target Vincent van Gogh. In this example, we cluster in the inCloud of Figure 2 is Cluster Cl1 ; /fictional_universe/ethnicity_in_fiction • an arrow represents a proximity relation between clus- /people/ethnicity /exhibitions/exhibition /media_common/lost_work ters/essentials, namely a closeness relationship between /common/webpage /en/farmhouse_in_provence the themes/topics their represent. The arrow thick- /en/correspondences_vincent_van_gogh_john_chamberlain /en/the_works_of_vincent_van_gogh ness denotes the degree of proximity between the two /en/copies_after_millet_and_others /en/dutch_people /freebase/equivalent_topic /tag/ukguardian/$002Fartanddesign$002Fvan-gogh /en/wheat_field_with_cypresses /en/letters_1886_90 /en/vincent_by_himself /en/self_portrait_with_bandaged_ear_and_pipe /en/self_portrait_dedicated_to_paul_gauguin /en/peasant_woman_against_a_background_of_wheat /en/self_portrait_with_dark_felt_hat_at_the_easel /en/van_goghs_letters_the_artist_speaks /en/falling_autumn_leaves /quotationsbook/quote/25663 /music/performance_role clusters/essentials connected by the arrow. /en/self_portrait_with_a_straw_hat /en/irises /en/a_corner_of_montmartre_the_moulin_a_poivre /en/the_real_van_gogh_the_artist_and_his_letters /en/the_night_cafe /en/vase_with_five_sunflowers /en/still_life_majolica_with_wildflowers /en/the_church_at_auvers /business/job_title /en/van_gogh_stained_glass_coloring_book /quotationsbook/quote/25726 /en/portrait_of_eugene_boch /en/self_portrait_with_pipe_and_glass /book/written_work/en/daubignys_garden Aggregation techniques are first employed to enforce a “the- /en/les_arenes /fictional_universe/character_occupation /en/la_mousme /en/wheat_field_with_crows /en/spring_in_arles /en/self_portrait_with_dark_felt_hat /exhibitions/exhibit /en/self_portrait_in_front_of_the_easel /quotationsbook/quote/3154 /quotationsbook/quote/3153 /en/vase_with_three_sunflowers /en/self_portrait_with_pipe_and_straw_hat /en/the_olive_trees /en/corridor_in_the_asylum /quotationsbook/quote/24567 /en/the_painter_of_sunflowers /en/portrait_of_joseph_roulin /en/catalogue_of_the_third_vincent_van_gogh_exhibition /en/i_am_not_an_adventurer_by_choice_but_by_fate /projects/project_role /en/starry_night_over_the_rhone /en/the_wheat_field /en/color_your_own_van_gogh_paintings / f il m / f il m _ jo b matic” clustering of the initial set of linked data, as de- /en/flowering_orchards /en/boats_on_the_beach /en/les_alyscamps /en/cartas_a_theo_cyc / en / ar t i s t /en/letters_from_provence/en/entrance_to_the_public_gardens_in_arles /en/the_van_gogh_album /quotationsbook/quote/36550 /en/the_drinkers/en/breton_women_and_children /en/four_cut_sunflowers /en/conscience_is_a_mans_compass /en/cafe_terrace_at_night /interests/collectable_item /en/self_portrait_with_straw_hat /en/self_portrait_with_bandaged_ear/en/van_gogh_on_art_and_artists /en/double-squares_and_squares /en/pork_butchers_shop_in_arles /media_common/quotation /en/at_eternitys_gate /visual_art/artwork /people/profession /en/correspondance_generale_tome_2 /en/painter /en/thatched_cottages_by_a_hill /en/lettres_a_son_frere_theo /en/bedroom_in_arles /book/book /en/portrait_of_madame_augustine_roulin /en/the_starry_night /visual_art/art_series /en/japonaiserie_flowering_plum_tree_after_hiroshige /en/subhash_awchat /en/a_good_picture_is_equivalent_to_a_good_deed /common/topic /en/the_old_cemetery_tower_at_nuenen /location/nl_municipality scribed in Section 3. Abstraction techniques are then ap- /en/the_paintings_of_van_gogh /en/vincent_van_gogh /en/portrait_of_vincent_van_gogh /en/a_peasant_woman_digging_in_front_of_her_cottage /en/vincent_van_gogh_famous_lives /en/flower_paintings_giftwrap_paper /en/schoolboy_camille_roulin /en/the_town_hall_at_auvers /en/ivy_two_paintings_by_vincent_van_gogh /en/willem_roelofs /en/view_of_arles_flowering_orchards /en/vase_with_fifteen_sunflowers /en/portrait_of_paul_eugene_milliet /en/nude_woman_on_a_bed /en/joan_glass /en/sunflowers /quotationsbook/quote/35294 /quotationsbook/quote/201 /en/portrait_of_adeline_ravoux /quotationsbook/quote/35821 /en/zundert /business/employer /location/citytown /organization/organization_scope plied to synthesize an inCloud over the thematic clusters, as described in Section 4. /en/paintings_watercolours_and_drawings /en/the_best_way_to_know_god_is_to_love_many_things /location/administrative_division /en/wheat_fields_at_auvers_under_clouded_sky /film/music_contributor /internet/social_network_user /en/there_is_no_blue_without_yellow_and_without_orange /people/deceased_person /en/fishing_in_spring_the_pont_de_clichy_asnires /government/governmental_jurisdiction /en/joan_mitchell /en/portrait_of_camille_roulin /en/two_cut_sunflowers /en/claude_monet /en/vase_with_twelve_sunflowers /en/self-portraits_by_vincent_van_gogh /location/statistical_region /biology/breed_origin /en/the_potato_eaters /influence/influence_node /en/portrait_of_dr_gachet /en/anton_mauve /en/the_red_vineyard /quotationsbook/quote/24565 /en/auvers-sur-oise /en/purvis_young /exhibitions/exhibition_subject /fictional_universe/person_in_fiction /location/dated_location /en/the_road_menders /en/still_life_with_apples_pears_lemons_and_grapes /visual_art/art_subject /organization/organization_member /music/group_member /music/guitarist /en/dick_bruna /en/the_bedroom /en/alexis_preller /en/jean-francois_millet /people/person /royalty/kingdom /en/garret_schuelke /en/the_poets_garden /quotationsbook/quote/8724 /book/book_subject /en/billy_childish /en/chuck_connelly /visual_art/visual_artist /en/netherlands /en/franz_marc /en/willem_de_kooning /en/hai_zi /military/military_combatant /en/the_roulin_family /en/impressionism /en/post-impressionism /award/award_nominee /music/artist /film/actor /en/arman /en/portrait_of_dr_gachet_first_version / b o o k / /film/film_subject author /en/kingdom_of_the_netherlands /en/paul_cezanne /en/suicide /en/henri_matisse /en/peter_paul_rubens /en/drawing /en/painting /location/location /government/government /food/beer_country_region /aviation/aircraft_owner /organization/organization_founder 3. LINKED DATA AGGREGATION /en/firearm /media_common/netflix_genre /book/illustrator /visual_art/art_owner /en/yves_saint-laurent /military/military_person /royalty/chivalric_order_member /interests/collection_category /time/event /location/country /olympics/olympic_participating_country /sports/sports_team_location The goal of aggregation techniques is to transform an ini- /media_common/quotation_subject /film/film_costumer_designer /fashion/fashion_designer /en/male /projects/project_participant /en/expressionism /people/cause_of_death /visual_art/art_period_movement /visual_art/visual_art_form /internet/website_category /sports/sport_country tial set of linked data into a number of thematic clusters. /film/person_or_entity_appearing_in_film /fashion/fashion_label /award/award_winner /education/field_of_study The starting point is a RDF graph Gs containing the linked /sports/sports_equipment /cvg/cvg_genre /people/gender /medicine/risk_factor /chess/chess_player /fictional_universe/character_powers /media_common/media_genre data about a certain seed s of interest automatically ex- /biology/hybrid_parent_gender /fictional_universe/character_gender /film/film_genre /architecture/architectural_style /fictional_universe/character_species /book/school_or_movement tracted from a Linked Data repository R. Appropriate ex- /freebase/task traction queries are defined to this end according to the lan- guage (e.g., SPARQL, MQL) supported by the repository R. These queries generally enforce the following extrac- tion/filtering operations: Figure 1: A graph of linked data extracted from the Freebase repository about the search target Vincent van • Extraction of properties and corresponding values within Gogh a distance ≤ d from the seed s. We consider that an URI in the repository R is concerned with the seed s if considered the seed s = /en/vincent_van_gogh, we explored there is a property path of length ≤ d between the URI the complete set of directly linked URIs and some selected and s. The distance d can be dynamically changed and URIs at distance d = 2 from s. As it is clear from this simple it has an impact on the number of extracted linked example, exploring such a flat and huge collection of data data and thus on the size of the resulting RDF graph. is cumbersome. First, because the representation is flat and In usual scenarios, a distance d = 2 is a good trade-off it is impossible to immediately understand whether some to obtain a sufficient number of linked data about s URIs are more important than others. Moreover, possible and a well-sized RDF graph. sets of URIs addressing the same/similar argument about the target are not highlighted nor grouped. • Extraction of the URI types. For each URI within a distance ≤ d from the seed s, we extract the list of The solution we propose is based on aggregation and ab- types (i.e., classes) the URI belongs to. The appropri- straction techniques to transform a basic, flat view of linked ate property of the repository R is exploited to this data like the one in Figure 1, into an inCloud providing a end (e.g., the property type in Freebase). high-level, thematic view of the same data. inClouds are conceived to be coupled with the conventional query inter- • Filtering of non-relevant properties. Loosely meaning- faces of the existing Linked Data repositories, in that they ful properties of a repository, like the property image can be built on top of an extracted dataset to provide a more of Freebase, can be excluded from the resulting RDF effective presentation of the result. graph since they are poorly useful in providing infor- An example of inCloud for the seed s = /en/vincent_ mation about s. inCloud for Vincent Van Gogh Cl1 Cl2 (/en/vincent_van_gogh) Topic Artwork /en/vincent_ Written Work Book /en/letters_from_ van_gogh (20) provence (3) ESSENTIAL Portrait /en/portrait_of_ Van Gogh /en/vincent_by_ camille_roulin (7) THEMATIC CLUSTER himself (3) /en/portrait_of_eugene_boch (5) /en/the_works_of_vincent_van_gogh (3) /en/portrait_of_adeline_ravoux (5) PROXIMITY LINK /en/the_painter_of_sunflowers (3) /en/color_your_own_van_gogh_paintings (3) /en/self_portrait_with_ bandaged_ear (3) /en/van_gogh_on_art_and_artists (3) ... /en/boats_on_the_beach (3) ... Topic Artwork Topic Artwork Garden Cl3 /en/the_starry_night (1) Sunflower /en/farmhouse_in_provence (1) /en/willem_roelofs (3) /en/the_painter_of_sunflowers (2) /en/the_painter_of_sunflowers (1) /en/vincent_van_gogh (3) /en/the_olive_trees (1) /en/anton_mauve (3) //en/vase_with_three_sunflowers (2) /en/spring_in_arles (1) /en/alexis_preller (3) /en/vase_with_twelve_sunflowers (2) /en/irises (1) /en/purvis_young (3) ... /en/vase_with_fifteen_sunflowers (2) /en/joan_glass (3) ... /en/two_cut_sunflowers (2) Cl5 Influence Node & Person & /en/four_cut_sunflowers (2) Visual Artist & Deceased Person ... Joan Willem Paul Roelofs Mauve Arman ... Cl4 Figure 2: An example of inCloud extracted from the Freebase repository for the seed /en/vincent van gogh The query result is the graph Gs = (Ns , Es ) where a node matching metric that considers the structure of the terms n ∈ Ns , called linked data entity, can be an URI, a literal, termx and termy . For σ calculation, we employ our match- or a type value that satisfy the query selection, and an edge ing system HMatch 2.0, where state-of-the-art metrics for e (ni , nj ) ∈ Es , called property link, represents a property string matching (e.g., I-Sub, Q-Gram, Edit-Distance, and Jaro- relationship of R between the nodes ni , nj ∈ Ns . Winkler) are implemented [2]. A similarity link e (ni , nj ) is established between the linked data entities ni and nj iff Based on the RDF graph Gs , linked data aggregation is σ(ni , nj ) ≥ th where th ∈ (0, 1] is a matching threshold de- articulated in two main steps, namely similarity evaluation noting the minimum level of similarity required to consider and thematic clustering. two linked data entities as matching entities. 3.1 Similarity evaluation 3.2 Thematic aggregation This step has the goal to analyze the graph Gs and to This step has the goal to analyze the graph Gs+ obtained generate an augmented linked data graph Gs+ where a sim- through similarity evaluation and to identify/mine a set ilarity link is added between each pair of matching linked CL of thematic clusters. Given a graph Gs+ , a cluster Cl data entities in Ns . To this end, the level of affinity be- = {(n1 , f1 ) , . . . , (nh , fh )} is a set of linked data entities tween the entities of Ns is evaluated as follows. Given two n1 , . . . , nh ∈ Ns that are more similar to each other than linked data entities ni , nj ∈ Ns , the linked data affinity to the other entities of Ns . Each entity nj belonging to Cl σ(ni , nj ) ∈ [0, 1] denotes the level of similarity of ni and is associated with a corresponding frequency fj which de- nj based on the commonalities of their terminological equip- notes the number of occurrences of nj in Cl. ments. Each linked data entity n ∈ Ns is associated with a terminological equipment Termn = {term1 , . . . , termm } Clusters are determined by exploiting the graph Gs+ and where termj , with 1 ≤ j ≤ m, is a term appearing in the by detecting those node regions that are highly intercon- label of a node adjacent to n in Gs , or a term appearing nected through property/similarity links. The problem of in the label of n itself. Before inclusion in a terminological thematic aggregation is analogous to the problem of cluster equipment, each term is submitted to a normalization pro- calculation, also known as module, community, or cohesive cedure for word-lemma extraction and for compound-term group, in graph theory. For this reason, for thematic aggre- tokenization [4, 15]. gation, we rely on a clique percolation method (CPM) [13]. The affinity σ of two linked data entities ni , nj ∈ Ns is The CPM is based on the notion of k-clique which corre- calculated as the Dice coefficient over their terminological sponds to a complete (fully-connected) sub-graph of k nodes equipments as follows: within the graph Gs+ . Two k-cliques are defined as adjacent 2· | termx ∼ termy | k-cliques if they share k − 1 nodes. The CPM determines σ(ni , nj ) = clusters from k-cliques. In particular, a cluster, or more | Termni | + | Termnj | precisely, a k-clique-cluster, is defined as the union of all k- where termx ∼ termy denotes that termx ∈ Termni and cliques that can be reached from each other through a series termy ∈ Termnj are matching terms according to a string of adjacent k-cliques. As a consequence, a typical k-clique- cluster is composed of several cliques (with size ≤ k) that In order to represent this fact, we introduce the notion of tend to share many of their nodes. Since the cliques of a prominence of a cluster, namely a value Pi ∈ [0, 1]. The graph can share one or more nodes, we observe that a node higher Pi is, the higher is also the prominence of Cli in the can belong to several clusters, and thus clusters can over- inCloud. In our approach, the level of prominence of a clus- lap. In our approach, we employ the CPM implemented ter is higher when the cluster is very focused on its theme in the CFinder tool2 . Although the determination of the full and its contents are homogeneous. In particular, we formal- set of cliques of a graph is widely believed to be a non- ize two cluster properties that are variability and density. polynomial problem, CFinder proves to be efficient when ap- Variability vi is the degree of overlap among the cliques plied to graphs like those considered in our approach. Such of the cluster Cli . For a linked data entity nj ∈ Ns+ , we an algorithm is based on first locating all complete sub- call fj the frequency of nj , that is the number of cliques of graphs of Gs+ that are not part of larger complete subgraphs, Cli that contain nj . Variability vi is measured by a coeffi- and then on identifying existing k-clique-clusters by carry- cient of variation, which is the ratio between the standard ing out a standard component analysis of the clique-clique deviation of the linked data entity frequencies in Cli and overlap matrix [6]. As a result, CFinder produces the full set the arithmetic mean of those frequencies, as follows (with f CL of k-clique-clusters existing in the graph Gs+ for all the denoting the arithmetic mean value of frequencies): possible values of k. A linked data entity ni belonging to a cluster Cl ∈ CL is represented as a pair (ni , fi ) where v u Ni the frequency value fi denotes the number of cliques of Cl 1u 1 X vi = t (fi − f )2 which the entity nj belongs to (see Example of Figure 2). f Ni − 1 i=1 The entities of a cluster are represented with different sizes, proportional to the corresponding frequency values accord- According to this definition, high values of vi denote a ing to a visualization manner “à la tag-cloud”3 . low degree of overlap in the cliques of the cluster Cli , while low values of vi denote a high degree of overlap in the Cli cliques. 4. LINKED DATA ABSTRACTION Density di of a cluster Cli is the degree of interconnection The goal of linked data abstraction techniques is to build among the linked data entities of Cli . The density coeffi- an inCloud, namely a high-level view on top of linked data cient di = 2 · Ri /Ni (Ni − 1) is the ratio between the number clusters by synthesizing them through essentials. inCloud Ri of links in the cluster Cli and the maximum number of clusters are also featured by a level of prominence and by possible links. According to this definition, high values of di proximity relations that denote the level of overlapping of denote a high degree of interconnection among the cluster the different clusters. Cli entities, while low values of di denote a low degree of interconnection. The prominence Pi of a cluster Cli is cal- 4.1 Essential abstraction culated on the basis of its variability and density as follows: An essential Essi is a concise and convenient summary of a thematic cluster Cli and it is defined as a pair of the form 2 · (1 − vi ) · di Essi = (Ci , Di ) where Ci is the category associated with Pi = (1 − vi ) + di Cli and Di is a descriptor associated with Cli . A category Ci is a set composed by the labels of the most frequent According to this approach, most prominent clusters are types of the linked data entities in Cli , while a descriptor those which are more focused and homogeneous with respect Di is a set composed by the most frequent terms in the to their theme. We graphically represent cluster prominence terminological equipments of the entities in Cli . If more by drawing circles proportional to the prominence values of than one most equally-frequent type and/or term exist, they the corresponding clusters. In our example of Figure 2, clus- are all inserted in Ci and Di , respectively. In the example ters Cl1 and Cl4 are more prominent (larger circles) because of Figure 2, the cluster Cl4 corresponds to a very focused they are more focused and homogeneous. On the opposite, theme expressed by the essential category Topic Artwork (the clusters like Cl3 , which collect several entities of different most frequent type of the entities in the cluster) and by types are considered less prominent (smaller circle). How- the essential descriptor Sunflower (the most frequent term in ever, other options are possible for the evaluation of promi- the terminological equipments of the entities in Cl4 ). In nence in case of specific application needs. A first option cases where many entities are equally frequent in a cluster, is to consider a cluster to be more prominent as it is more the abstracted essential is less focused and contains more close to the seed s of interest. In this case, the prominence terms. This is the case for example of the cluster Cl3 of Pi of a cluster Cli is evaluated by taking into account the Figure 2, representing persons and visual artists influenced average value of similarity between the linked data entities by Van Gogh. In this case, the most frequent terms used in the cluster Cli and s, weighted by the frequency of each as descriptors are the names of the people involved in the entity ni in Cli , as follows: cluster, which are all equally frequent in the cluster. Ni 4.2 Prominence evaluation P σ(ni , s) · fi p=1 Clusters (and related essentials) in an inCloud are dif- Pi = Ni ferently relevant with respect to the original search target. P fi p=1 2 Available at http://www.cfinder.org/. 3 For a more readable visualization of highly-populated clus- where fi denotes the frequency of the linked data entity ni ters, the representation of less-frequent linked data entities in the cluster Cli . Another option is to consider the promi- can be omitted. nence Pi of a cluster Cli as proportional to the dimension Ni of Cli and to the size ki of the smaller clique in Cli , as tension to the multi-repository exploration and to the multi- follows: Pi = 2 · Ni · ki /Ni + ki . seed extraction can be performed. 4.3 Proximity relations Extension to multi-repository exploration. For a more In an inCloud, clusters (and consequently their associated complete visualization of the available linked data about essentials) are connected by reciprocal proximity relations, a certain search target, multiple RDF repositories can be which represent the degree of overlapping between them. queried to originate a unique, comprehensive inCloud. In In particular, given two clusters Cli and Clj , the degree of the Linked Data Cloud, the property owl:sameAs is used to proximity Xij =| Cli ∩ Clj | / | Cli | between Cli and Clj is denote when a linked data entity ni belonging to a certain proportional to the number of linked data entities common RDF repository R and another entity nj belonging to a dif- to Cli and Clj over the number of linked data entities in Cli . ferent repository R0 refer to the same real-world object. In The greater the level of overlapping between Cli and Clj , a multi-repository scenario, the construction of the graph the higher the degree of their proximity relation. Proximity Gs can take into account the owl:sameAs relations as a sort relations are graphically represented by arrows with thick- of “natural join” operation. The idea is to start the con- ness proportional to the proximity degree. In Figure 2, we struction of Gs by querying an initial repository R and to can see how proximity relations connect those clusters that exploit the owl:sameAs relations to extend the linked data ex- are more semantically related to each other, such as Cl2 , traction to other RDF repositories. In particular, the URIs Cl4 , and Cl5 which all represent different types of artworks connected by a owl:sameAs relation are collapsed in a unique by Vincent van Gogh. linked data entity of Gs and the extraction/filtering opera- tions described in Section 3 are applied to the whole set of 5. USING INCLOUDS FOR THEMATIC EX- linked data extracted by the considered RDF repositories. PLORATION Extension to multi-seed extraction. In some cases, the In this section, we discuss how inClouds can be exploited user can be interested in exploring the available linked data for thematic exploration of linked data and we provide some about more than one seed of interest. In this framework, considerations about the applicability of the inCloud ap- the inCloud mechanism can be used to build a comprehen- proach in the large-scale scenario. sive thematic picture that takes into account all the seeds of interest. In a multi-seed scenario, the starting point is a 5.1 Thematic exploration through inClouds set of seeds S = {s1 , . . . , sk }. The graph Gs is built by ex- An inCloud enables different exploration modalities that ecuting the extraction/filtering operations of Section 3 for can be switched on according to the specific user preferences. each element si ∈ S. Depending on the seeds of interest, In particular, the following modalities are defined. one or more portions of the graph Gs can be disjoint from the rest of the graph. In particular, when the seeds in S • Exploration-by-essential. This is the most intuitive ex- are about completely different arguments, a separate inde- ploration modality and it is based on cluster essentials. pendent cluster is generated through aggregation for each A user can consider each essential as a sort of instanta- si ∈ S. In such a limit case, the usefulness of the inCloud neous picture of the associated cluster and linked data mechanism for exploration is in the capability of providing therein contained, thus allowing the user to rapidly an effective synthetic essential for each seed si ∈ S and in choose the most preferred one for starting the explo- calculating the relative prominence of each seed with respect ration. to the others. • Exploration-by-prominence. This modality allows the We stress that linked data exploration in-the-large can re- user to organize the exploration according to the promi- quire the execution of thematic aggregation techniques over nence values associated with the clusters. The idea is a starting RDF graph Gs containing a huge number of nodes to support the user in moving throughout the clusters (e.g., thousands of linked data entities). The clique perco- according to their relevance with respect to the set of lation method we use for cluster calculation best performs considered linked data. As discussed in Section 4, dif- when a small-medium number of nodes in the graph Gs is ferent criteria can be used to calculate the prominence considered (e.g., hundreds of linked data entities). For ex- value. The capability to switch from one criterion to ample, in our tests, the CPM over a graph Gs containing another allows the user to dynamically re-organize the 200 nodes takes an execution time of 200ms (considering a inCloud in light of a different notion of cluster promi- matching threshold th=0.9). For linked data exploration in- nence. the-large, when 1.000 (or more) nodes are considered, more • Exploration-by-proximity. This modality enables the efficient clustering algorithms, like hierarchical clustering, user to choose a cluster and to browse its constella- can be exploited (see [3] for further details). tion, by exploiting the proximity relations. When a user is exploring a certain cluster, the proximity re- 6. RELATED WORK lations provide indication of its fully/partially over- Problems and solutions more strictly related to our work lapping neighbors, thus suggesting the possible explo- are focused either on improving search and retrieval of in- ration of clusters that are somehow related in content. formation in the Linked Data cloud [14] or on browsing and presentation of linked data contents [5]. Search and 5.2 Linked data exploration in-the-large retrieval is moving from traditional information lookup to The presented inCloud approach can be also exploited for exploratory search, defined as the activity of finding and applicability in the large scale scenario. In particular, ex- understanding knowledge about a topic of interest by ex- ploiting aggregation and learning of information in a so- [2] S. Castano, A. Ferrara, and S. Montanelli. Matching cial context [11]. In this respect, for example, Sig.ma (Se- Ontologies in Open Networked Systems: Techniques mantic Information MAshup) [16] retrieves and integrates and Applications. Journal on Data Semantics, linked data, starting from a single URI, by querying the V:25–63, 2006. Web of Data and applying machine learning to the data [3] S. Castano, A. Ferrara, and S. Montanelli. Structured found. In a similar direction, structured and collabora- Data Clouding across Multiple Webs. Technical tive search engines are being emerging as a promising so- report, Università degli Studi di Milano, 2011. lution for presenting the query results in a sort of struc- [4] S. Castano and G. Varese. Next Generation Data tured form and focusing on the understanding of the user Technologies for Collective Computational Intelligence, information need. Examples in this field are Wolfram Al- chapter Building Collective Intelligence through pha (http://www.wolframalpha.com), Google Wonder Wheel Folksonomy Coordination, pages 87–112. Springer, (http://www.googlewonderwheel.com), and YAGO2 (http: 2011. //www.mpi-inf.mpg.de/yago-naga/yago). Another cate- [5] S. Davies, J. Hatfield, C. Donaher, and J. Zeitz. User gory of related work includes approaches aiming at present- Interface Design Considerations for Linked Data ing linked data in a more intuitive way. Examples of solu- Authoring Environments. In Proc. of the WWW Int. tions in this respect are [8, 12] and Freebase Parallax (http:// Workshop on Linked Data on the Web (LDOW 2010), www.freebase.com/labs/parallax/), where tools that help Raleigh, NC, USA, 2010. users in exploring DBpedia and Freebase are presented, not [6] B. Everitt. Cluster Analysis. Edward Arnold, London, only via directed links in the RDF dataset, but also via UK, 3rd edition, 1993. newly discovered knowledge associations and visual naviga- [7] W. Halb, Y. Raimond, and M. Hausenblas. Building tion. These tools exploit aggregation techniques in order Linked Data for both Humans and Machines. In Proc. to combine related topics in unified nodes, providing also a of the WWW Int. Workshop on Linked Data on the textual description of each node. In other approaches, like Web (LDOW 2008), Beijing, China, 2008. Marbles (http://www5.wiwiss.fu-berlin.de/marbles) and [8] C. Hirsch et al. Interactive Visualization Tools for LESS (http://less.aksw.org), information about resources Exploring the Semantic Graph of Large Knowledge of interest is presented exploiting HTML and RSS and by Spaces. In Proc. of the IUI Int. Workshop on Visual using different colors to distinguish sources. Interfaces to the Social and the Semantic Web, With respect to the related work, our contribution regards Sanibel Island, USA, 2009. the use of data similarity, proximity, and prominence tech- [9] A. Hogan, A. Harth, A. Passant, S. Decker, and niques for inCloud construction, to move from a basic, flat A. Polleres. Weaving the Pedantic Web. In Proc. of organization of linked data to a high-level, thematic view of the WWW Int. Workshop on Linked Data on the Web them. Moreover, the proposed techniques allow the differ- (LDOW 2010), Raleigh, NC, USA, 2010. ent themes/topics to directly emerge from the original linked data and their mutual links, by suggesting also an intuitive [10] A. Leclercq. he perceptual evaluation of information visualization of data contents in terms of essentials, which systems using the construct of user satisfaction: case synthesize the contents of thematic clusters. study of a large french group. ACM SIGMIS Database, 38(2):27–60, 2007. [11] G. Marchionini. Exploratory Search: from Finding to 7. CONCLUDING REMARKS Understanding. Communications of the ACM, In this paper, we presented inClouds, high-level views of 49(4):41–46, 2006. linked data enabling their thematic exploration. Ongoing [12] R. Mirizzi, A. Ragone, T. Di Noia, and E. Di Sciascio. work is focused on finalizing the development of a web ap- Semantic Wonder Cloud: Exploratory Search in plication fully covering the steps of linked data aggregation DBpedia. In Proc. of the ICWE 2nd Int. Workshop on and abstraction required for inCloud construction. By ex- Semantic Web Information Management (SWIM ploiting an initial prototype implementation, we run some 2010), pages 138–149, Vienna, Austria, 2010. experiments concerning user evaluation of inClouds based [13] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. on standard user-oriented evaluation methods for interac- Uncovering the Overlapping Community Structure of tive web search interfaces and systems [10]. Initial results Complex Networks in Nature and Society. Nature, are promising and inClouds are seen by real users as a valid 435:814–818, 2005. support to the satisfaction of users information needs [3]. [14] D. Petrelli, S. Mazumdar, A. Dadzie, and Moreover, ongoing research activity regards the extension F. Ciravegna. Multi Visualization and Dynamic Query of the inCloud approach to consider additional kinds of for Effective Exploration of Semantic Data. In Proc. of web data contents, like microdata, microblogging posts, and the 8th Int. Semantic Web Conference, pages 505–520, news. The idea is to propose inClouds as a comprehensive Chantilly, VA, USA, 2009. exploration tool considering also actual, up-to-date social web information about the search target for possible fruition [15] S. Sorrentino et al. Schema Normalization for in the framework of event-promoting applications. Improving Schema Matching. In Proc. of the 28th Int. ER Conference, pages 280–293, Gramado, Brazil, 2009. 8. REFERENCES [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - [16] G. Tummarello et al. Sig. ma: Live Views on the Web The Story So Far. Int. Journal on Semantic Web and of Data. Web Semantics: Science, Services and Agents Information Systems, 5(3):1–22, 2009. on the World Wide Web, 8(4):355–364, 2010.