=Paper=
{{Paper
|id=Vol-1316/paper4
|storemode=property
|title=A Thin-Server Approach to Ephemeral Web Personalization Exploiting RDF Data Embedded in Web Pages
|pdfUrl=https://ceur-ws.org/Vol-1316/privon2014_paper4.pdf
|volume=Vol-1316
|dblpUrl=https://dblp.org/rec/conf/semweb/NartTD14a
}}
==A Thin-Server Approach to Ephemeral Web Personalization Exploiting RDF Data Embedded in Web Pages==
A Thin-Server Approach to Ephemeral Web Personalization Exploiting RDF Data Embedded in Web Pages Dario De Nart, Carlo Tasso, Dante Degl’Innocenti Artificial Intelligence Lab Department of Mathematics and Computer Science University of Udine, Italy {dario.denart,carlo.tasso}@uniud.it, dante.deglinnocenti@spes.uniud.it Abstract. Over the last years adaptive Web personalization has be- come a widespread service and all the major players of the WWW are providing it in various forms. Ephemeral personalization, in particular, deals with short time interests which are often tacitly entailed from user browsing behaviour or contextual information. Such personalization can be found almost anywhere in the Web in several forms, ranging from targeting advertising to automatic language localisation of content. In order to present personalized content a user model is typically built and maintained at server-side by collecting, explicitly or implicitly, user data. In the case of ephemeral personalization this means storing at server-side a huge amount of user behaviour data, which raises severe privacy con- cerns. The evolution of the semantic Web and the growing availability of semantic metadata embedded in Web pages allow a role reversal in the traditional personalization scenario. In this paper we present a novel approach towards ephemeral Web personalization consisting in a client- side semantic user model built by aggregating RDF data encountered by the user in his/her browsing activity and enriching them with triples extracted from DBpedia. Such user model is then queried by a server ap- plication via SPARQL to identify a user stereotype and finally address personalized content. Key Words: User Modeling, Open Graph protocol, Ephemeral Per- sonalization, RDFa, DBpedia, Privacy 1 Introduction Personalization is one of the leading trends in Web technology today and is rapidly becoming ubiquitous on the Web. Most of the times the process is evident, for instance when web sites require their users to sign in and ask for their preferences in order to maintain accessible and per- sistent user profiles. But in other cases personalization is more subtle and it is hidden to the user. Ephemeral personalization [13], for instance, aims at providing person- alized content fitting only short-term interests that can expire after the current navigation session. Most of the times such kind of personaliza- tion does not require the user to sign in, since all the information needed to determine which content should be presented may be found in his/her browsing cache and/or content providers are not interested in modelling and archive such short-term interests. An example of ephemeral per- sonalization is targeted adversing, that is providing personalized ads to users as they browse the Web. This task is currently accomplished by checking which cookies are present in the client’s browser cache and se- lecting candidate ads accordingly. This process, however, in most cases results in a particular ad from a previously visited site, “stalking” in such way the user throughout all his/her browsing activities. As the authors of [10] suggest, this may generate a revenue for the advertiser by en- couraging customers to return, but can also be extremely annoying and the users may perceive their privacy attacked. Other forms of ephemeral personalization are guided by contextual information derived from the IP address of the client or by analysing the content of the pages that the client requests, like in Amazon’s product pages: these are very shallow forms of personalization and do not involve a persistent, long term, user model. In this work, we propose another way to address ephemeral Web per- sonalization. Our approach consists in collecting semantic metadata con- tained in visited web pages in order to build a client-side user model. Such model is then queried by content providers to identify a user stereotype and consequently recommend content. By doing this the user has total control over his/her user model and the content provider does not need to save and maintain user profiles, therefore privacy risks are significantly reduced. Before proceeding forth into the technical matter we would like to point out that our approach heavily relies on the availability of semantic meta- data embedded in Web pages: the more metadata available, the more detailed the user profile, vice versa, if visited pages do not contain meta- data, no user profile can be built. Luckily, a huge number of Web sites actually provides semantic annotations, consisting of Microformats, Mi- crodata, or RDFa data, mostly conformed to vocabularies such as Face- book’s Open Graph protocol, hCard, and Schema.org. The rest of the paper is organized as follows: in Section 2 we briefly intro- duce some related works; in Section 3 we present the proposed system; in Section 4 we illustrate our data model; in Section 5 we discuss some experimental results and, finally, in Section 6 we conclude the paper. 2 Related Work Several authors have already addressed the problem of generating, pars- ing, and interpreting structured metadata embedded in Web sites. Sev- eral metadata formats aimed at enriching Web pages with semantic data have been proposed and adopted by a wide range of Web authors. Micro- formats1 such as hCard, proposed in 2004, extend HTML markup with structured data aimed at describing the content of the document they are included in. Although similar under several aspects, Microformats are not compliant with the RDF data model, which can be fully exploited by using RDFa data. Facebook’s Open Graph Protocol 2 , is an RDFa vocabulary proposed in 2010 and has become extremely widespread on the Web due to the fact that it makes possible establishing connections between Web contents and any Facebook user’s social graph. However, it also caused many concerns about its privacy and service-dependency issues [19]. Another relevant format is schema.org 3 , proposed in 2011 and supported by several major search engines such as Google, Yahoo, Yandex, and Bing. Schema.org data can be expressed as RDFa or Micro- data, an HTML specification aimed at expressing structured metadata in a simpler way than the one provided by Microfomats and RDFa. The authors of [2] provide an extensive survey of the presence of such metadata on the Web based on the analysis of over 3 billion Web pages, showing how, despite being Microformats still the most used format, RDFa is gaining popularity, severely outnumbering Microdata. More- over, they measured that 6 out of the 10 most used RDFa classes on the Web belong to the Open Graph protocol. Automatic metadata generation has been widely explored and can be achieved in many ways: extracting entities from text [6], inferring hi- erarchies from folksonomies [18], or exploiting external structured data [11]. Interoperability issues among various metadata formats have been discussed as well: for instance, the authors of [1] propose a metadata conversion tool from Microformats to RDF. Other authors have discussed how Semantic Web tools, such as ontolo- gies and RDF, can be used to model users’ behaviours and preferences in Recommender Systems [7]. However, the field on which most research efforts are focused is Personalized Information Retrieval. For instance in [17] is presented an approach towards Ontological Profile building ex- ploiting a domain ontology: as the user interacts with the search engine, interest scores are assigned to concepts included in the ontology with a spreading activation algorithm. The authors of [5] discuss a system that builds a user model aggregating user queries raised within a session and matching them with a domain ontology. Finally, the authors of [4] 1 http://microformats.org/ 2 http://ogp.me/ 3 http://schema.org/ and [14] suggest that ontological user models can be built as “personal ontology views”, that are projections of a reference domain ontology de- rived by observing user interest propagation along an ontology network. However, in all these works, user profiles are specializations or projec- tions of a domain ontology and therefore their effectiveness relies on the availability, scope, and quality of such pre-existing asset. The problem of preserving users’ privacy while providing personalized content, presented in [16] and recently extensively surveyed in [9], has been widely discussed in literature and many authors tried to address it. The authors of [12] show how to adapt the leading algorithms in the Net- flix Prize in order to achieve differential privacy, that is the impossibility of deducing user data by observing recommendations. The authors of [3] propose a different approach in which part of the user data remains at client side in order to preserve user privacy. Personal Data Store appli- cations, such as OpenPDS 4 , instead provide a trusted, user controlled, repository for user data and an application layer which can be used by service providers to address personalized content without violating users’ privacy. Finally, a recent patent application [20] also claims that the so- called targeting advertising can greatly benefit from the use of semantic user models extracted from Web usage data. The authors, however, do not provide any hint on their extraction technique, focusing, instead, on the architecture and deployment issues of their system. 3 System Architecture In order to support our claims, we developed an experimental system consisting in a client and a server module built using well-known open source tools such as Apache Jena and Semargl. Figure 1 shows the work- flow of the system. The basic idea behind our work is that user interests can be identified by observing browsing activity and by analysing the content of visited Web sites, thus our goal is to exploit the user him- self as an intelligent Web crawler to provide meaningful data for build- ing his/her personal profile, therefore the project was named Users As Crawlers (herein UAC ). A compact OWL2 ontology, herein referred as UAC ontology, was developed as well in order to introduce new modelling primitives and to support classification of Web pages. Among others, the primitives defined in the UAC ontology are: relatedTo, which associates Web pages with DBpedia entities named in the metadata, nextInStream, which associates a page with the next one visited by the user, and pre- viousInStream, which is the inverse of nextInStream. The client module handles user modelling: it includes three modules, a Metadata Parser, a Data Linker, a Reasoner, and a compact triple- store. The Matadata Parser reads the header sections of the visited web pages and extracts RDF triples from available RDFa metadata. Due 4 http://openpds.media.mit.edu Figure 1. Work flow of the System. to its large availability, the preferred metadata format is OpenGraph RDFa, however other formats are allowed as well, as long as they can be converted into RDF. The Data Linker receives the collected triples as input and adds new triples by linking visited pages with DBpedia entities. This task is accomplished by both expanding URIs pointed by object properties and by analysing the content of datatype properties such as tag, title, and description in order to find possible matches with DBpedia entries. A list of stopwords and a POS Tagger are used by the Data Linker to identify meaningful sequences of words to be matched against DBpedia. Finally, the augmented set of triples is processed by a Reasoner module, performing logic entailments in order to classify vis- ited pages according to the Oper Graph protocol, the DBpedia ontology, and the UAC ontology. In our prototype the reasoning task is performed by the OWL Lite Reasoner that comes bundled with Apache Jena, but any other OWL Lite or DL reasoner (e.g: Pellet) could fit as well. The result of this process is a RDF user model, built incrementally as the user visits Web pages, in which visited pages are classified by rdf:type prop- erties and have a hopefully high number of semantic properties linking them each other and to DBpedia. In our prototype system the client is a standalone application, however, in a production scenario it could be a Web browser plug-in, in order to incrementally build the user profiles as pages are downloaded by the Web browser. Since the client contains the user model, it also allows control over it: the user can decide which pages to be included in the model and whether to maintain crawled data or not between different browsing sessions. Moreover, the user model can be exported at any time to a RDF-XML file to be inspected; though currently the client has no user model visualization module, several vi- sualization and editing tools are available, such as Protege5 . The server part of the system is designed to simulate a content provider scenario and consists in two modules, a Semantic Recommender, a User Inquirer, and a content repository. The goal of the server is to identify a user stereotype in order to suggest to the user a relevant content, but instead of maintaining user profiles in a repository, it just “asks” to con- nected clients if their user models have some characteristics, much like the Guess Who game. Since all the user modelling duties are left to the client, the server module is particularly lightweight and, therefore, we defined it a thin server to emphasize the role reversal with respect to the traditional Web personalization approach. We assume each content to be addressed towards a specific user stereotype, which is a realistic assumption since many e-commerce companies already perform market segmentation analysis. We exploit such knowledge in order to map user characteristics into a specific stereotype and therefore contents to be rec- ommended. More specifically, in our current experimental system we use a decision tree for classifying the user, as shown in Figure 2. Each node is associated with a specific SPARQL query and each arc corresponds to a possible answer to the parent node’s query. Stereotypes are identified on the leaves of the tree. When a client connects, it receives the SPARQL Figure 2. A decision tree with SPARQL queries on the nodes and user stereotypes on the leaves. query associated with the root node in order to check whether a specific 5 http://protege.stanford.edu/ characteristic is present in the user model. The Semantic Recommender module handles the client’s answer to the query and when it receives a positive answer fetches content or, if the answer is negative, further queries are proposed until a user stereotype can be identified. Due to the hierarchical nature of the decision tree, we expect the number of queries to be asked to the client to be very small: indeed, in our experimental setting in the worst case six queries were needed. In order to preserve users’ privacy, the server does not have full access to user RDF data: though any SPARQL query is allowed, the client returns only the num- ber of Web pages matching the query, therefore their URIs are unknown to the server. 4 Data linking and classification In order to better present our user modelling technique, in this section we are showing a step-by-step example of what the proposed system does when a Web page is visited. In this example we are considering a randomly chosen Rottentomatoes.com page and in Figures 3, 5, and 6 extracted and inferred data are shown in the RDF-XML syntax. As the page is loaded, the Metadata Parser retrieves the available RDFa mata- data. Metadata commonly embedded in Web pages actually provide a very shallow description of the page’s content: the Open Graph proto- col itself specifies only four properties as mandatory (title, image, type, and url ) and six object classes (video, music, article, book, profile, and website). However, these information constitutes a good starting point, especially when a few additional, but optional, properties (e.g.: descrip- tion, video:actor, and article:tag) are also specified, which can provide URIs of other entities or possible “hooks” to more descriptive ontologies. In Figure 3 is shown the metadata available in the Web page, including both mandatory and optional Open Graph properties. The next mod- Figure 3. Metadata extracted from the example Web page elling step, performed by the Data Linker module, aims at enriching such data by linking the visited page to possibly related ontology entities. This step needs a reference ontology and we adopted a general purpose and freely available reference ontology, i.e. DBpedia. This choice is motivated by three factors: (i) in a realistic scenario it is impossible to restrict users’ Web usage to a particular domain, (ii) authors may describe their con- tents in ways non compliant to a single taxonomy crafted by a domain expert, therefore, the ontology needs to be the result of a collaborative effort, and (iii) since the modelling task is to be accomplished at client- side, we need a freely accessible ontology. Figure 4. An example of DBpedia data linking The Data Linker analyses the RDF data extracted from the pages in order to discover “hooks” to DBpedia, that are named entities present in DBpedia mentioned in the body of properties such as title or description. Such properties are analysed by means of stopwords removal and POS tagging to find possible candidate entities; candidate entities are then matched against DBpedia entries to get the actual ontology entities. As shown in Figure 4, in our example the association between the value of the title property and the Mad Max DBpedia entity is trivial, since the value of the title property and the one of the rdfs:label property of the DBpedia entity are the same string. However there may be more complex cases: for instance the title “Annual Open Source Development Survey” can provide a hook for the Open Source entity. Once these entities have been identified, they are linked to the Web page RDF representation with a relatedTo property, defined in the UAC Ontology. An additional UAC property, prevInStream, containing a link to the page that precedes the considered one in the navigation stream, is added to the page data as shown in Figure 5. All the rdf:type, dc:subject, and db:type attributes of the linked DBpedia entity are then imported into the RDF user model, in order to provide further information about Figure 5. The example page linked to related DBpedia entities and the previously visited page the contents of the page and to support the classification task. The final step of the modelling activity is the classification performed asynchronously by the Reasoner, which integrates the data gathered in the previous steps with class and property axioms provided by the Open Graph protocol and by the UAC ontology. Due to its asynchronous na- ture, the Reasoner can also link a page to the one subsequently visited us- ing the nextInStream property. The result, as shown in Figure 6, is, aside from the nextInStream property, a series of rdf:type properties which pro- vide a classification of the visited page. For instance, in our example the entailed properties state that the URI http://www.rottentomatoes.com/ m/mad max/ corresponds to a Website, a Work (as defined in DBpedia), and a Film. Such type properties are added to the RDF representation of the Web page, along with the crawled data and the DBpedia data generated in the previous steps, and stored in the user model. Figure 6. The visited page data annotated with the properties inferred by the Rea- soner. The user model is built incrementally as Web pages are visited and can be queried at any time using the SPARQL query language. This choice allows the server to formulate an extremely wide range of queries, from very generic to utterly specific, and to achieve an arbitrary level of detail in user stereotyping. For instance, the content provider may be interested in recommending just movies rather than books and, by asking for the Web pages with a rdf:type property set to “Film”, it will detect, in our example, that the user visited a page about a movie. In another scenario, the content provider may be interested in determining exactly which kind of movie to recommend and by asking for all the Web pages related to a DBpedia entity with a dcterms:subjectproperty set to “Road Movie” it will detect that the user visited a page about a road movie. Countless other examples are possible: for instance a content provider might be interested in knowing if the user has visited sequentially a given number of sites dealing with the same topic or if he/she has ever visited a page dealing with multiple given topics. 5 Evaluation Formative tests were performed in order to evaluate the accuracy of the proposed method. In our experiment, we asked a number of volunteers (mostly university students) to let us use their browsing histories, in order to have real-world data. To avoid biases, browsing data was ex- tracted only form sessions occurred in the five days before the test was performed. All test subjects were completely unaware of the real pur- pose of the experiment. After supplying the data, volunteers were asked to review their own browsing history in order to identify different sessions and to point out what they were actually looking for. At the end of the interviews we were able to identify six user stereotypes, much like mar- ket analysts do when performing segmentation analysis. Since we had no real content to provide in this experiment, we only classified users. The six identified stereotypes are: (i) people mostly interested in economics (nicknamed business), (ii) mostly interested in courses, seminars, sum- mer schools, and other educational events (student), (iii) mostly inter- ested in films and tv series (moviegoer ), (iv) mostly interested in music (musician), (v) mostly interested in videogames (gamer ), and, finally, (vi) people whose main interests are hardware, programming, and tech- nology in general (techie). Four iterations of the data gathering and testing process were performed, each time with different volunteers, in order to incrementally build a data set and to evaluate our approach with different users with different browsing habits and different size of the training set. In the first iteration 36 browsing sessions were collected and labelled, in the second 49, in the third 69, and in the fourth 82. All RDFa data included in the Web pages visited by our test users was considered and used by our test prototype to build a RDF user model for each browsing session. Over the three iterations, the average number of Web sites visited in a single browsing session was 31.5 and the average number of RDF triples extracted from a browsing session after the Data Linker performed its task was 472.8, that is an average 15 triples per page, which actually provides a significant amount of data to work on. During each iteration of the evaluation, the rdf:type properties of the visited Web pages were considered as features and used to train a Deci- sion Tree. In this experiment the J48 algorithm [15] was used; in Figure 7 we show an example of a generated tree, built during the third iter- ation. The nodes of the tree were then replaced with SPARQL queries and then this structure was used to classify a number of user models. Due to the shortage of test data, a ten-fold cross validation approach was used to esteem the accuracy of the system. Table 1 shows the re- Figure 7. A decision tree built during the third iteration of the experiment sults of the classification over the four iterations of the data set. Our system was compared with the ZeroR predictor, which always returns the most frequent class in the training set in order to have a baseline. For this formative experiment, only the precision metric (defined as the number of correctly classified sessions over the total number) was consid- ered. Though precision values are not very high, it is important to point out two limitations of the performed tests: the number of considered browsing sessions is extremely low, due to the fact that only a handful of volunteers let us analyse and use freely their browsing history data; in fact many volunteers dropped out as soon as they realized that their actual browsing history and not some laboratory activity was needed. Secondly, these results were obtained by considering only the rdf:type attribute as feature when building the decision tree. Evaluation and de- Table 1. Average precision of the UAC system and of a ZeroR classifier on the con- sidered data sets. Data Set size ZeroR precision Tree precision 36 0,306 0.639 49 0,195 0.601 69 0,217 0.623 82 0,203 0.631 velopment are ongoing and further experiments, with more test users, more stereotypes, and a richer RDF vocabulary are planned. 6 Conclusion and Future Work In this paper we presented a new approach towards ephemeral personal- ization on the Web, relying on semantic metadata available on the web and, even though the presented results are still preliminary, the overall outcome is very promising. With the growth of the Web of Data, we expect in the next few years to be able to raise the average number of extracted triples from a browsing session and therefore build more de- tailed user profiles. In our opinion this approach could fit particularly well to the applica- tion domain of targeted advertising because of three major advantages over the actual cookie-based techniques: (i) our approach can recommend novel contents related to current browsing context, rather than associate a user with a set of already visited (and potentially disliked) contents, (ii) the explicit decision model of the decision tree can easily be reviewed by domain experts, supporting market analysis and knowledge engineer- ing, and (iii) by deploying the user model at client side, the user has total control over his/her own data, addressing many privacy concerns. However, the proposed approach has one major drawback: in order to receive personalized contents, users have to install a client, which may be either a browser plug in or a standalone application. Anyway, this seems to be necessary for providing real privacy and also other works aimed at addressing the privacy issues of online advertising have stated the need of a software agent [8]. Our future plans include, among other extensions, the integration of a Keyphrase extraction module aimed at automatically extracting significant phrases from textual data included in Web pages, and enrich in such way the content metadata available for the Reasoner and Recommender modules [6]. Future work will also address scalability issues, possibly replacing some of the currently employed libraries with ad-hoc developed modules. References 1. Adida, B.: hgrddl: Bridging microformats and rdfa. Web Semantics: Sci- ence, Services and Agents on the World Wide Web 6(1), 54–60 (2008) 2. Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: Deployment of rdfa, microdata, and microformats on the web–a quan- titative analysis. In: The Semantic Web–ISWC 2013, pp. 17–32. Springer (2013) 3. Castagnos, S., Boyer, A.: From implicit to explicit data: A way to enhance privacy. Privacy-Enhanced Personalization p. 14 (2006) 4. Cena, F., Likavec, S., Osborne, F.: Propagating user interests in ontology- based user model. AI* IA 2011: Artificial Intelligence Around Man and Beyond pp. 299–311 (2011) 5. Daoud, M., Tamine-Lechani, L., Boughanem, M., Chebaro, B.: A session based personalized search using an ontological user profile. In: Proceedings of the 2009 ACM symposium on Applied Computing. pp. 1732–1736. ACM (2009) 6. De Nart, D., Tasso, C.: A domain independent double layered approach to keyphrase generation. In: WEBIST 2014 - Proceedings of the 10th Inter- national Conference on Web Information Systems and Technologies. pp. 305–312. SCITEPRESS Science and Technology Publications (2014) 7. Gao, Q., Yan, J., Liu, M.: A semantic approach to recommendation system based on user ontology and spreading activation model. In: Network and Parallel Computing, 2008. NPC 2008. IFIP International Conference on. pp. 488–492 (Oct 2008) 8. Guha, S., Cheng, B., Francis, P.: Privad: practical privacy in online adver- tising. In: Proceedings of the 8th USENIX conference on Networked sys- tems design and implementation. pp. 13–13. USENIX Association (2011) 9. Jeckmans, A.J., Beye, M., Erkin, Z., Hartel, P., Lagendijk, R.L., Tang, Q.: Privacy in recommender systems. In: Social Media Retrieval, pp. 263–281. Springer (2013) 10. Lambrecht, A., Tucker, C.: When does retargeting work? information specificity in online advertising. Journal of Marketing Research 50(5), 561– 576 (2013) 11. Liu, X.: Generating metadata for cyberlearning resources through infor- mation retrieval and meta-search. Journal of the American Society for Information Science and Technology 64(4), 771–786 (2013) 12. McSherry, F., Mironov, I.: Differentially private recommender systems: building privacy into the net. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 627–636. ACM (2009) 13. Mizzaro, S., Tasso, C.: Ephemeral and persistent personalization in adap- tive information access to scholarly publications on the web. In: Proceed- ings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems. pp. 306–316. AH ’02, Springer-Verlag, Lon- don, UK, UK (2002), http://dl.acm.org/citation.cfm?id=647458.728228 14. Osborne, F.: A pov-based user model: From learning preferences to learn- ing personal ontologies. In: User Modeling, Adaptation, and Personaliza- tion, pp. 376–379. Springer (2013) 15. Quinlan, J.R.: C4. 5: programs for machine learning, vol. 1. Morgan kauf- mann (1993) 16. Ramakrishnan, N., Keller, B.J., Mirza, B.J., Grama, A.Y., Karypis, G.: Privacy risks in recommender systems. IEEE Internet Computing 5(6), 54–63 (2001) 17. Sieg, A., Mobasher, B., Burke, R.D.: Learning ontology-based user pro- files: A semantic approach to personalized web search. IEEE Intelligent Informatics Bulletin 8(1), 7–18 (2007) 18. Tang, J., Leung, H.f., Luo, Q., Chen, D., Gong, J.: Towards ontology learning from folksonomies. In: IJCAI. vol. 9, pp. 2089–2094 (2009) 19. Wood, M.: How facebook is ruining sharing. Weblog post 18 (2011) 20. Yan, J., Liu, N., Ji, L., Hanks, S.J., Xu, Q., Chen, Z.: Indexing semantic user profiles for targeted advertising (Sep 10 2013), uS Patent 8,533,188