Semantic Technologies to Support the User-Centric Analysis of Activity Data Mathieu d’Aquin, Salman Elahi and Enrico Motta Knowledge Media Institute, The Open University, Milton Keynes, UK {m.daquin, s.elahi, e.motta}@open.ac.uk Abstract. There is currently a trend in giving access to users of on- line services to their own data. In this paper, we consider in particular the data which is generated from the interaction between a user and an organisation online: activity data as held in websites and Web applica- tions logs. We show how we use semantic technologies including RDF integration of log data, SPARQL and lightweight ontology reasoning to aggregate, integrate and analyse activity data from a user-centric point of view. 1 Introduction Social interactions on the Web, especially between individual users and organi- sations, rely on the exchange of personal data. As discussed in the article ”Show Us the Data! (It is ours after all)” in the New York Times by Richard H. Thaler1 , while being heavily exploited by online organisations, these data are rarely made accessible to the users themselves. However, many initiatives have emerged re- cently arguing that obtaining and being able to exploit such data could be very beneficial to individual users. The mydata project in the UK2 for example fo- cuses on consumer data. At Google, the Data Liberation Front 3 has been formed to push the deployment of mechanisms allowing users to extract their data from Google services. In relation to this, there is currently a wide expansion of the idea of self-tracking, with new forms of applications being created based on social and personal data on the Web (see e.g., [1, 2]). There are however specific challenges that appear when trying to apply such a user-centric perspective on activity data. Activity data typically sits in the logs of websites and Web applications, and are exploited by online organisa- tions, in an aggregated form, to provide overviews of the interactions between the organisation’s online presence and its users (most commonly in the form of website analytics). UCIAD4 is a short project with the aim to experiment with the technological challenges that are faced when trying to invert the perspective 1 http://www.nytimes.com/2011/04/24/business/24view.html 2 http://www.bis.gov.uk/news/topstories/2011/Apr/better-choices-better-deals 3 http://www.dataliberation.org/ 4 http://uciad.info on activity data: provide individual users with an overview of their interactions with the online organisation. This raises a number of challenges for which the use of semantic technologies seem to provide adequate solutions: Fragmentation and heterogeneity: Activity data is typically held in log files that have different formats, and might not be easily integratable from one system (website, application) to another. User identification: Recognising and identifying a user within the data is typ- ically a problem faced by any activity data analysis. However, when taking a user-centric perspective, a user needs to be identified over several systems and the consequences of inaccurately recognising a user can be more critical. Data analysis: Activity data is generally available through raw, uninterpreted logs from which meaningful information is hard to obtain. Scale: Tracking user activities through logs can generate immense amounts of data. Typical systems cope with such scale through aggregating data based on clusters of users. Here, we need to keep the whole set of data for each individual user available to provide meaningful analysis of their interaction with the organisation in a user-centric way. In this paper, we show how we investigated and handled these challenges through relying on semantic technologies, especially RDF for the low level inte- gration and management of data, ontologies for the aggregation of heterogeneous data and their interpretation, and lightweight ontological reasoning to support customisable analysis of user-centric activity data. We also discuss how these components have been put together in a demonstrator platform, the UCIAD platform, providing user-centric views on activity data related to several web- sites of the Open University. 2 Activity Data Integration - Base Architecture There are two reasons why we believe semantic technologies can benefit the anal- ysis of activity data in general, and from a user-centric perspective in particu- lar. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use ontologies as “pivot” models for data coming from different systems, but such models are also easily extensible to take account of the particularities of the systems available, but also to allow for custom extensions to deal with particular users, making personalised analysis of activity data feasible. The overall architecture of the activity data infrastructure set up for the UCIAD project is shown in Figure 1. Its goal is to support the extraction from a variety of logs, of homogeneous representations of the traces of activity data present in these logs and store them in a common semantic store so that they can be accessed and queried by the user. We use RDF as a common data model, and a triple store providing SPARQL querying facilities for storing and accessing the Fig. 1. Overview of the architecture of the UCIAD platform. data.5 Information from logs is extracted on a daily basis and represented using the ontologies described in the next section, which together with the semantic store represent the basis of the platform to provide user-centric views on activity data. 3 Aggregating Heterogeneous Activity Data - The UCIAD Ontologies Compared to other domains, the advantage of user activities is that there is a lot of data to look at. This might be seen as an issue (from a technical and conceptual point of view), but in reality, this allows us to apply a bottom- up approach to building the ontologies necessary to achieve our goal: modelling through characterising the data, rather than through conceptualising the domain from established expert knowledge. It also gives us an insight into the scale of the task, and the need for adapted tools to support both the ontological definition of specific situations, and the ontology-based analysis of large amounts of traces of activity data. 5 We use OWLIM (http://www.ontotext.com/owlim) which provides scalable storage and lightweight reasoning facilities. 3.1 Identifying Concepts and their Relations The first step in building our ontologies is to identify the key concepts, i.e., the key notions, that we need to tackle, bearing in mind that our ultimate goal is to understand activities. We rely extensively on website logs as sources of activity data. In these cases, we can investigate requests both from human users and from robots automatically retrieving and crawling information from the websites. The server logs in question represent collections that can be seen as traces of activities that these users/robots are realising on websites. We therefore need to model these other aspects, which correspond to actions that are realised by actors on particular resources. We propose three ontologies to be used as the basis of the work in UCIAD: The Actor Ontology is an ontology representing different types of actors (hu- man users vs. robots), as well as the technical settings through which they realise online activities (computer and user agent). The Sitemap Ontology is an ontology to represent the organisation of web- pages in collections and websites, and which is extensible to represent differ- ent types of webpages and websites. The Trace Ontology is an ontology to represent traces of activities, realised by particular agents on particular resources (here, mostly webpages). As we currently focus on HTTP server logs, this ontology contains specific sections related to traces as HTTP requests (e.g., HTTP methods are represented as instances of “Action” and HTTP response codes are included within “Re- sponse” type objects). It is however extensible to other types of traces, such as specific logs for particular applications. 3.2 Reusing Existing Ontology When dealing with data and ontologies, reuse is generally seen as a good practice. Apart from saving time from not having to remodel things that have already been described elsewhere, it also helps anticipating on future needs for interoperability by choosing well established ontologies that are likely to have been employed elsewhere. We therefore investigated existing ontologies that could help us define the notions mentioned above. Here are the ontology we reused: The FOAF ontology (http://xmlns.com/foaf/spec/) is commonly used to describe people, their connections with other people, but also their con- nections with documents. We use FOAF in the Actor Ontology for human users, and in the Sitemap Ontology for webpages (as documents). The Time Ontology (http://www.w3.org/TR/owl-time/) is a common ontology for representing time and temporal intervals. We use it in the Trace Ontology. The Action ontology (http://ontology.ihmc.us/Action.owl) defines dif- ferent types of actions in a broad sense, and can be used as a basis for repre- senting elements of the requests in the UCIAD Trace Ontology, but also as a base typology for actions. It itself relies on a number of other ontologies, including its own notion of actors. While not currently used in our base ontologies, other ontologies can be considered at a later stage, for example to model specific types of activities. These include the Online Presence Ontology (OPO6 ), as well as the Semantically- Interlinked Online Communities ontology (SIOC7 ). The current version of the ontologies developed as part of this work are available at https://github.com/uciad/UCIAD-Ontologies. 4 Identifying and Extracting User Activity Data Once activity data have been extracted and represented according to the ontolo- gies briefly described above, the next step consists in identifying and aggregating, in this data the traces of activities realised by a particular user, in order to cre- ate a user-centric view of his or her interactions with the considered systems (websites, applications). 4.1 Overview The information the UCIAD platform collects regarding users can be seen as similar to the one basic analytics systems have. The user is rarely seen directly, as the interaction is mediated through a “user agent”: a software programme running on a particular computer. Each HTTP request is associated with the ID of the user agent realising it, and the IP address of the corresponding com- puter. Several analytics systems use the combination of these two parameters to recognise a user with a reasonable level of accuracy. The disadvantage however is that the same user can be using different agents (e.g., different browsers) and different computers (or even mobile phones) to access the Web. In UCIAD, we have the advantage that it is very likely that the user will con- nect to the UCIAD platform using the same agents and computers they usually use to access the Web, and especially the considered websites. The “settings” the user is using can therefore be detected at the time of logging in, and be attached to the user account. These settings will then be used to aggregate all the activity data that have been realised using the same computer and user-agent, and be added to the set of activity data for the particular user. In addition, this provides a convenient mechanism to aggregate information realised on different computers and different settings. The user can log again in the UCIAD platform with a different browser, or a different device. When that happens, as described in Figure 2, the current setting will simply be added to the list of known settings for this user, and contribute another set of activity data around this particular user. A setting, in our ontology, corresponds to a computer (generally identified by its IP address) and an agent (generally a browser), identified by a complex string such as 6 http://online-presence.net/opo/spec/ 7 http://sioc-project.org/ontology Fig. 2. Associating user accounts to their settings. Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24) Such a setting can be associated to a user based on a representation following our ontologies described above, such as in the example below: 187.108.24.45 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24) This indicates that the user http://uciad.info/actor/mathieu has three set- tings. These settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things). 4.2 Extracting User-Related Data Managing the activity data regarding a particular user corresponds to creating a sub-graph of the complete graph of raw activity data we collect from logs, based on the information about the known settings of the user. To identify a user, we rely here on the settings used to realise the activity. Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Using a SPARQL Construct query, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites: PREFIX tr: PREFIX actor: construct { ?trace ?p ?x. ?x ?p2 ?x2. ?x2 ?p3 ?x3. ?x3 ?p4 ?x4 } where{ actor:knownSetting ?set. ?trace tr:hasSetting ?set. ?trace ?p ?x. OPTIONAL {{?x ?p2 ?x2}. OPTIONAL {{?x2 ?p3 ?x3}. OPTIONAL {?x3 ?p4 ?x4}}} } The results of this query correspond to all the traces of activities in the col- lected data that have been realised through known settings of the user http://uciad.info/actor/mathieu, as well as the surrounding information. These data, materialised as an RDF graph, can therefore be considered on its own, as a user-centric view on the activity data available through integrated logs. 4.3 Managing Access Right over Semantic Data We store, manipulate and reason over activity data using Semantic Web tech- nologies, namely RDF, a triple store with inference capabilities and SPARQL for querying. As part of the UCIAD platform, we needed a mechanism to restrict the queries being sent to only the part of the data that the current user has access to: his/her own subgraph of activity data. Unfortunately, most current triple stores, and especially the one we are em- ploying, do not provide sufficiently fine-grained access control mechanisms, al- lowing to associate sub-graphs to particular users. We therefore implemented our own mechanism, which can be seen as a generic recipe for access control over activity data. Fig. 3. Overview of the mechanism for access right to data in a SPARQL endpoint. The idea, as depicted in Figure 3, is that the actual SPARQL endpoint giving access to all the data for all the users is being hidden using standard security measures so that it can only be accessed by our own system. We then implement a “proxy SPARQL endpoint” that can handle basic HTTP authentication. When receiving a query, this proxy endpoint will check the credential of the user and see what sub-graphs the user has access to, so that it can modify the query to restrict it to these sub-graphs only (using the FROM clause in SPARQL). It can then send the query to the real, hidden SPARQL endpoint and forward the results back to the user. While this mechanism is relatively simple, it offers an appropriate level of flexibility, allowing to define arbitrary subgraphs and user definitions as a model for access control. 5 Interpreting and Analysing Activity Data through Lightweight Ontology Reasoning Here, we want to use the ontologies we have created, and extend them, so that they can support the interpretation and analysis of the extracted activity data. What we want to achieve is, through providing ontological definitions of differ- ent types of activities and resources, to be able to characterise different types of traces and classify them as evidences of particular activities happening. The first step in realising such inferences is to characterise the resources over which activities are realised – in our case, websites and webpages. Our ontolo- gies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and websites that are present on the considered server, as well as the URL pat- terns that make it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expressions and an au- tomatic process is applied to declare triples of the form page1 isP artOf website1 or page2 isP artOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively. The base ontologies we have defined can then be extended to represent par- ticular categories of resources, depending on their properties. We for example declare a particular website as a W iki. We can also declare a webpage collec- tion that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of W ikiU pdate as the set of webpages which are both part of a W iki and part of the RSSF eed collection, i.e., in the OWL abstract syntax Class(WikiUpdateFeed complete intersectionOf(Webpage restriction(isPartOf someValuesFrom(RSSFeed)) restriction(isPartOf someValuesFrom(Wiki)))) We can similarly define the activity of checking and federating updates from a wiki by creating the class of traces of activities (requests) realised on a W ikiU pdateF eed using an RSSClient as user agent. Another example would be defining the class ExecutingASP ARQLQuery as the one of sending a request to a page of the type SP ARQLEndpoint using a query parameter. Such definitions can be added to the repository, which, using its inference capability, will derive that certain pages are W ikiU pdateF eeds, and certain activities correspond to ExecutingASP ARQLQuery without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. 6 Implementation: the UCIAD Platform We realised the UCIAD platform as a demonstrator, where a user can register to the platform with some setting details and browse his or her activity data as they appear on several Open University websites (mostly, an internal wiki system and the Open University’s linked data platform – data.open.ac.uk).8 The current “in development” version of the platform implements and demon- strates the following components described above: User management: As the user registers into the UCIAD platform, his current setting is automatically detected, and other settings (other browsers) that are likely to be associated to him or her are also included. As the user registers, the settings are associated to his account and the activity data realised through these settings are extracted. Extracting user-centric activity data: As described in Section 4.2, the set- tings associated with the user are used to extract the activity data around this particular user, creating a sub-graph corresponding to his or her activity. Ontologies to make sense of activity data: The ontologies are used in struc- turing the data according to a common schema and to provide a base to ho- mogeneously query data coming from different systems. As discussed above, they can also be extended (specified) so that different categories of activities and resources can be represented, and reasoned upon. Ontological reasoning for analysis: Activity data is organised according to different categories (traces, webpages, websites, settings, etc.) coming from the base ontologies, but also according to classes of activities, resources, etc. that have been specially added to cover the websites and the particular user in this case (see Section 5). Here, we extended the ontologies in order to include definitions of activities relevant to the use of a wiki and a data platform. For example, we define “Executing a SPARQL Query” as an ac- tivity that takes place on a SPARQL endpoint with a “query” parameter, or “Checking Wiki Updates” as an activity on a Wiki page that is realised through an RSS client. Browsing data according to ontologies: We rely on an homemade “browser” that we use in a number of projects and that can inspect ontology classes and members of these classes, generating graphs and simple statistics for these classes and members. 7 Discussion and Future Work While the UCIAD platform provides an interesting first attempt at demonstrat- ing the feasibility of user-centric activity data based on semantic technologies, a number of challenges are left to be considered before such technologies could be deployed in realistic settings to provide Web users with an appropriate view on 8 see http://uciad.info/ub/2011/08/final-post-putting-things-together-with-a-demo/ for a description and a video of this demonstrator. their own activity data. The first, technical challenge is scalability. Indeed, triple stores such as OWLIM can now handle very large amounts of data (see the benchmark tests in [3, 4]). However, activity data in the form of traces from logs are enormous. Indeed, an average Web server from the Open University would serve a few million requests per month. Each request (summarised in one line in the logs) is associated with a number of different pieces of information that re-factor in terms of our ontolo- gies, concerning the actor (IP, agent), the resource (URL, website it is attached to, server), the response (code, size) and other elements (time, referrer). We can obtain between 20 and 50 triples per request. This leads us to amounts of data in the order of 100 million triples per month per server (each server can host many websites). In theory, OWLIM should cope with such a scale, even if we consider several servers over several months. However, the data we are uploading to OWLIM is complex, and has a refined structure. Some objects (user settings, URLs) would appear very connected, while others would only appear in one re- quest, and share only a few connections. From our experience, it is not only the number of triples that should be considered, but also the number of objects. A graph where each object is only associated with 1 other object through 1 triple might be a lot more difficult to process than one with as many triples, but shared amongst significantly less nodes (see [5]). This scale issue is amplified when inference mechanisms are applied. OWLIM handles inferences at loading times. This means that not only the number of triples uploaded onto the store are multiplied through inferences, but also that immensely more resources are required at the time of loading these triples, de- pending not only on the size of what is uploaded, but also on its complexity (and, as mentioned above, our data is complex) and on the size of what is al- ready stored. Originally, our approach was to have one store holding everything with inferences, and to extract from this store data for each user. We changed this approach to one were the store that keeps the entire dataset extracted from logs does not make use of inference mechanisms. Data extracted for each user are then transferred into another (necessarily smaller) store for which inferences apply. A less technical challenge for approaches to activity data relying on a user- centric perspective is the identification of user-related data and their distribu- tion. Indeed, as we explained in Section 4, we identify users based on a number of indicators detected at the time the user is registering and logging in. These indicators are far from being 100% accurate. Other types of systems can cope with inaccuracy as they are generally eliminated or reduced when the data is being aggregated. However, here, providing activity data to the wrong user could create critical privacy issues that need to be considered. More robust security mechanisms, as well as more accurate user identification mechanisms (using for example the cookies employed by Web tracking systems) would need to be de- ployed. Another crucial element concerns the distribution of the data. One of the important aspects of user-centric data is that the user should be able to export his or her own data, in order to exploit them for their own benefit. The ownership of the data is not however very clear in this case. It is data collected and delivered by our systems, but that are produced out of the activities of the user. We believe that in this case, a particular type of license is needed, which would give control to the user on the distribution of their own data, but without opening it completely. References 1. d’Aquin, M., Rowe, M., Motta, E.: Self-tracking on the web: Why and how. In: W3C Workshop on Web Tracking and User Privacy. (2011) 2. d’Aquin, M., Elahi, S., Motta, E.: Personal monitoring of web information exchange: Towards web lifelogging. In: Web Science 2010, WebSci10: Extending the Frontiers of Society On-Line, poster. (2010) 3. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics 3(2) (2005) 4. Bizer, C., Schultz, A.: The berlin sparql benchmark. International Journal on Semantic Web and Information Systems 5(2) (2009) 5. Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and oranges: A comparison of RDF benchmarks and real RDF datasets. In: SIGMOD. (2011)