MI-Search: a Smart Approach for Urban Information Clouding1 Stefano Montanelli and Silvana Castano Università degli Studi di Milano DICo - Via Comelico, 39 - 20135 Milano {stefano.montanelli,silvana.castano}@unimi.it Abstract. In this paper, we present an approach for urban-centered and calendar-oriented surfing of web contents according to the personal preferences of a user profile and interests. Real examples related to the city of Milan are discussed in the paper to illustrate the technical peculiarities of the proposed approach. Keywords: Urban Information Clouding, Web Content Classification 1 Introduction The recent innovations in the field of Web 2.0 and Semantic Web have radically changed the way that web contents are surfed and explored. On one side, the growing availability of user-generated contents, like microblogging posts and RSS news, typical of Web 2.0 and Social Web platforms, has posed the question of how to effectively handle and index this huge amount of short and rapidly-obsolescent data [5]. On the other side, the success of the Linked Data paradigm is enforcing the upcoming data-oriented vision of the Semantic Web, in spite of the conventional resource-oriented model [1]. The result is that the existing techniques for web content classification, search, and presentation are actually inadequate to satisfy the user needs in such a pervasive and highly-dynamic scenario. In this paper, we present an approach for urban-centered and calendar-oriented surfing of web contents according to the personal preferences of a user profile and interests. Such an approach has been developed in the framework of the MI-Search project co-funded by Regione Lombardia and Fastweb S.p.A.. A distinguishing feature of MI-Search is the capability to go beyond the actual interoperability problems concerned with the capability to exploit traditional web sites and spontaneous user comments/posts in an integrated way. This allows to enable a cloud-based web exploration, where all the available information about a topic/event of interest are delivered to the user in a comprehensive, intuitive picture. By urban- centered, we mean the MI-Search is tailored to work on the specific scenario of a 1 This work is funded by Regione Lombardia and Fastweb S.p.A. in the framework of the Dote in Ricerca project. 18 selected city, such as the city of Milan in our case, with the goal to focus the web contents to consider on a selected target. By calendar-oriented, we mean that the events and meetings noted in the personal agenda can be exploited to automatically select the web contents that can be suggested as potentially interesting for the final user. Real examples related to the city of Milan are discussed in the paper to illustrate the technical peculiarities of MI-Search. 2 Overview of MI-Search The MI-Search approach is shown in Figure 1 and it is characterized by the following distinguishing features. Fig. 1. The MI-Search approach Capability to consider different kinds of web contents in a seamless way. MI-Search is conceived to deal with contents extracted from different kinds of web resources. In particular, in MI-Search, we distinguish three different kinds of web resources, that are tagged resources, microdata resources, and semantic web resources (see the bottom part of Figure 1). Tagged resources are traditional web resources (i.e., web pages) and they are characterized by a raw structure with few metadata. Microdata resources are posts/comments coming from news feeds and microblogging systems (e.g., Facebook, Twitter posts). A microdata resource is characterized by a short textual content and a set of metadata/properties, like title, author, and creation date, that are commonly employed to describe publishing items. Semantic web resources are instances/individuals coming from RDF(S) knowledge repositories and OWL ontologies and they are characterized by a structured description composed of a set of assertions denoting its specification in the web document of origin. MI-Search successfully supports content interoperability by providing a support repository where all the considered contents are stored according to a reference data model developed in the project. 19 Capability to perform a similarity-based aggregation of web contents. The web contents stored in the support repository are submitted to a matching and classification process where contents referring to the same argument are first detected and then aggregated in similarity clusters (see the middle part of Figure 1). A similarity cluster is defined to collect web resources that can have a different nature, but are similar in content. In other words, a similarity cluster represents a specific argument and it contains all the web resources, either tagged, microdata, and semantic web resource, that refer to that argument. Capability to tailor the contents to deliver according to the user profile and interests. The similarity clusters are exploited by the final users during their content surfing activities. By relying on the user profile/interests, the similarity clusters that are interesting for the user are selected. This way, MI-Search succeeds in tailoring the most appropriate information and/or the suggestions to deliver to the user according to the specific scenario that is currently enforced (see the top part of Figure 1). The MI-Search project distinguishes two different kinds of user categories: the personal users and the business users. Personal users are users interested in receiving information and they exploit the MI-Search technology for obtaining contents and suggestions about their personal interests and events in the agenda. Business users are users interested in public events and other possible situations that are suitable for promoting their business activities. In this respect, the following three main scenarios have been envisaged in MI-Search: ─ Search-2-me scenario. This is the typical scenario of personal users and it is triggered when new personal events are planned by the user in the agenda. By exploiting the user agenda, the MI-Search technology discovers the user interests and it can provide a complete set of information about a planned event. In particular, MI-Search retrieves spontaneous information and user-generated contents related to the considered event, like comments from other similar users and special user offers joint with the participation to the event. As an example, we consider an art-exhibition event about the singer Fabrizio de André located in Milan at Rotonda della Besana. The user plans to visit this exhibition and a personal event is inserted in the agenda for a certain date. Through specialized websites (e.g., http://www.fabriziodeandrelamostra.com), MI-Search automati- cally provides to the user all the available information about the exhibition and about the singer. Moreover, other information are extracted by the MI-Search technology from social networks (e.g., Facebook2, Twitter3) to provide comments of other users that previously visited the exhibition. ─ Me-2-search scenario. This is the typical scenario of business users and it is triggered by the user when she/he start browsing the available suggestions that the system provides as potentially interesting opportunities for promoting the 2 http://www.facebook.com/. 3 http://www.twitter.com/. 20 user business. The user can browse a suggestion list of public events that can be interesting from the business point of view and she/he can decide to insert in the system a new business offer joint with a suggested event. For example, we consider a business user that has a sushi restaurant located in Milan, Viale Montenero (near to Rotonda della Besana). When the user starts browsing the possible suggestions, the art exhibition about Fabrizio de André at Rotonda della Besana is retrieved (due to a geo-locality proximity). The business user can decide to insert in the system a special menu price for the exhibition visitors. Such an offer will be linked to the art exhibition event and it will be visualized by personal users that plan to visit the exhibition. ─ Recommend-2-me scenario. This is a basic scenario of personal users and it is permanently active without requiring any triggering event. The recommend-2- me scenario is based on the user interests expressed in the personal profile to suggest events and/or (promotional) initiatives that can be potentially interesting. In this scenario, the user periodically receives a report with a list of upcoming events, either public and business events, that match her/his preferences for possible selection (and subsequent insertion in the personal agenda). As an example, we consider a personal user who specified an interest for sushi restaurants in her/his profile. Receiving the periodic report of interesting upcoming events, the user becomes aware of the special menu price of the sushi restaurant in Viale Montenero joint with the art exhibition about Fabrizio de André. The user can decide to visit the exhibition with the goal to subsequently take advantage of the special sushi offer. A personal event is inserted in the user agenda to plan the visit and to receive further information about the event (see the search-2-me scenario). 3 The MI-Search techniques In the following, we discuss some technical details of MI-Search with special reference to those aspects of the project that are concerned with interoperability issues. In particular, web content acquisition and web resource matching and classification of MI-Search will be presented. 3.1 Web content acquisition MI-Search is based on a support repository called MI-Search-DB capable of storing all the different kinds of web contents considered in the project through a uniform representation. The representation of specific features for event localization, such as spatial/temporal coordinates, is also enforced in MI-Search-DB. The repository is implemented as a PostgreSQL relational database, whose ER schema is shown in Figure 2. In the schema, we note that any kind of considered web content is represented through the entity Web Content. Web contents are distinguished in events (entity Event) and resources (entity Resource). 21 Fig. 2. The schema of the MI-Search-DB repository for web content acquisition Event. Events are classified in public events (entity Public Event), that represent official initiatives like art exhibitions or concerts, and business events (entity Business Event) that represent commercial initiatives inserted by business users. An event is characterized by attributes that describe its temporal frame (i.e., from-date, to- date, time, and frequency) and other features, like description and price (where needed). The entity Event is associated with the entities Contact and Location to represent the different contact-points for the event (e.g., Phone, Facebook page, Twitter channel) and the geo-coordinates where the event takes place, respectively. Resource. Resources are web contents acquired from outside the system and they distinguished in Tagged Resource, Microdata Resource, and Semantic Web Resource as discussed in Section 2. Tag. Each web content, either event or resource, is associated with a set of tags (entity Tag) denoting the keywords that mostly characterize the event/resource. For an event, the set of tags can be automatically extracted from one or more reference website. This usually happens with public events. Otherwise, tags can be manually inserted by the user that inserts the event. This usually happens with business events. For a resource, the set of tags is automatically extracted from the resource content itself. In a tagged resource, tags are extracted from bookmarking and social annotation systems (e.g., Delicious, Flickr). In a microdata resources, tags are extracted from the textual resource content and from other available metadata/properties, like the title. In a semantic web resource, tags are extracted from literals, property names, and property values contained in the RDF/OWL assertions of the resource specification. We note that, before insertion in the entity Tag, a tag is submitted to a normalization procedure for word-lemma extraction and for compound-term tokenization [4,7]. Example. In Figure 3, we consider two examples of acquired web contents. 22 Fig. 3. Examples of web resource acquisition Figure 3(a) shows a RSS post published on a well-known electronic wall about events planned in the city of Milan (http://blog.milano-italia.it/). This is an example of public event related to the art-exhibition about Fabrizio de Andrè located at at Rotonda della Besana. Besides the featuring attributes expected in MI-Search-DB for a public event, contact and location information are also provided. Figure 3(b) shows a comment posted on the Facebook social network published by a user that visited the art-exhibition about Fabrizio de André. This is an example of microdata resource featured by its URL on the web as expected in MI-Search-DB. Moreover, either the public event and the microdata resource, are associated with a set of tags automatically extracted from the two posts as a sort of synthetic characterization of each web contents. 3.2 Web content matching and classification The goal of matching and classification in MI-Search is to detect and build the similarity clusters to use for content delivery to the final users. Content matching. This step has the goal to evaluate the degree of similarity between each pair of web contents stored in the MI-Search-DB. Given two web contents wci and wcj, the similarity coefficient σ(wci,wcj) ∈ [0,1] denotes the level of similarity of wci and wcj based on their commons tags. We define Tagwc = {tag1, …, tagm} as the set of tags associated with the web content wc in MI-Search-DB. The similarity coefficient σ(wci,wcj) is calculated as follows: 2 ∗ |  ~  |  ,   | |  | | 23 where tagx ~ tagy denotes that tagx ∈ Tagwci and tagy ∈ Tagwcj are matching tags according to a string matching metric that considers the structure of tagx and tagy. For σ calculation, we employ our matching system HMatch 2.0, where state-of-the-art metrics for string matching (e.g., I-Sub, Q-Gram, Edit-Distance, and Jaro-Winkler) are implemented [2]. Content classification. Similarity clusters are built by relying on a clique percolation method (CPM) [6]. This method receives in input a graph G where nodes are the web contents stored in the MI-Search-DB repository and edges are established between any pair (wci, wcj) of similar contents for which σ(wci,wcj) ≥ th (th ∈ (0,1] is a matching threshold denoting the minimum level of similarity required to consider two web contents as matching contents). The CPM returns a set of similarity clusters where each cluster collects a region of nodes in G that are more densely connected to each other than to the nodes outside the region. The CPM is based on the notion of k-clique which corresponds to a complete (fully-connected) sub-graph of k nodes within the graph Gs+. Two k-cliques are defined as adjacent k-cliques if they share k - 1 nodes. The CPM determines clusters from k-cliques. In particular, a cluster, or more precisely, a k-clique-cluster, is defined as the union of all k-cliques that can be reached from each other through a series of adjacent k-cliques. More technical details about the CPM and the construction of similarity clusters can be found in [3]. Example. We consider the example shown in Figure 3. We call wc1 the public event of Figure 3(a) and wc2 the microdata resource of Figure 3(b). The similarity coefficient of wc1 and wc2 is σ(wc1,wc2) = 0.35 due to the matching tags in Tagwc1 and Tagwc2. With a matching threshold th = 0.3, the web contents wc1 and wc2 are considered as matching contents and an edge (wc1, wc2) is set in the graph G that is passed to the CPM method for calculation of the similarity clusters. An example of similarity cluster is shown in Figure 4. Besides the web contents wc1 and wc2, the cluster of Figure 4 contains a Flickr image taken from the exhibition (tagged resource), another Facebook user comment (microdata resource), the Freebase page about Fabrizio de André (semantic web resource), and the contact information of the Aoyama restaurant, a sushi restaurant that published in MI-Search a discounted dinner offer associated with the art-exhibition at Rotonda della Besana (i.e., business event). Such a cluster will be exploited by the delivery services of MI-Search when a request about the Fabrizio de André art-exhibition is submitted by a user. 4 Concluding remarks In this paper, we presented the main features of the MI-Search project for urban- centered and calendar-oriented surfing of web contents. Technical issues about web content acquisition, matching, and classification as well as real examples applied to the city of Milan are also discussed. 24 Fig. 4. An example of similarity cluster Ongoing research work is devoted to study the problem of periodically refreshing the contents of the MI-Search-DB repository and to complete the acquisition of a dataset about the city of Milan to be employed for experimentation. Moreover, matching techniques combining both string-based techniques and position-based techniques are currently under development as well as techniques for content delivery based on similarity cluster exploitation. Finally, next-future activities will be focused on the development of a mobile prototype based on the presented ideas. References 1. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. Int. Journal on Semantic Web and Information Systems 5(3) (2009) 2. Castano, S., Ferrara, A., Montanelli, S.: Matching Ontologies in Open Networked Systems: Techniques and Applications. Journal on Data Semantics V (2006) 3. Castano, S., Ferrara, A., Montanelli, S.: Thematic Exploration of Linked Data. In: Proc. of the 1st VLDB Int. Workshop on Searching and Integrating New Web Data Sources (VLDS 2011). Seattle, USA (2011) 4. Castano, S., Varese, G.: Next Generation Data Technologies for Collective Computational Intelligence, chap. Building Collective Intelligence through Folksonomy Coordination. Springer (2011) 5. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press (2010) 6. Palla, G., Derènyi, I., Farkas, I., Vicsek, T.: Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society. Nature 435 (2005) 7. Sorrentino, S., et al.: Schema Normalization for Improving Schema Matching. In: Proc. of the 28th Int. ER Conference. Gramado, Brazil (2009)