=Paper= {{Paper |id=Vol-1316/paper2 |storemode=property |title=Privacy Implications of Online Consumer-Activity Data: An Empirical Study |pdfUrl=https://ceur-ws.org/Vol-1316/privon2014_paper2.pdf |volume=Vol-1316 |dblpUrl=https://dblp.org/rec/conf/semweb/Thomas14 }} ==Privacy Implications of Online Consumer-Activity Data: An Empirical Study== https://ceur-ws.org/Vol-1316/privon2014_paper2.pdf
          Privacy Implications of Online
    Consumer-Activity Data: An Empirical Study

                                  Keerthi Thomas

                               The Open University,
                          Walton Hall, Milton Keynes, UK
                           keerthi.thomas@open.ac.uk



      Abstract. Web users allow online organisations to collect vast amounts
      of information about them and their activities, often in exchange for
      free services. But now there is a growing expectation that this users’
      data, generally called consumer data, should be given back to the users
      who helped create them so that it can be exploited for their benefit. In
      addition, there is a realisation that such a release of users’ data could
      only promote greater transparency and accountability of organisations
      collecting them. As with any process where data is published, there is
      a risk that it could potentially lead to complex privacy issues. In this
      paper, we focus on what we believe is a significant and yet least explored
      data type: consumer-activity data, i.e., data (Web access logs) generated
      by an organisation which tracks the usage and interactions of its online
      services and resources. We conducted an exploratory qualitative study
      of 12 users to investigate what might be the consequences of making
      such consumer-activity data available to the users who generated them,
      especially its privacy challenges, both from an organisation’s point of
      view and that of individuals whose online activities were being tracked.
      This was achieved by exposing the study’s participants to a ‘personal
      analytics’ dashboard which provided access to information on their usage
      and interactions with online systems of a large educational organisation
      (The Open University in the UK). The findings from our study showed
      that though there were potential benefits for the users, there were several
      privacy risks which are yet to be addressed.

      Keywords: Consumer-activity data, privacy


1    Introduction
Organisations, both the ones which model their business around free web based
services (e.g. Google’s search engine, Facebook’s social networking service, etc.)
as well as more traditional institutions with a significant web presence, passively
collect vast amounts of data about their users and exploit them in a variety of
ways, either to indirectly benefit by improving the user experience or directly
benefit through advertising, collective filtering, etc. From users’ perspective, this
self-serving exploitation of users’ data is perceived to be unequally benefiting
only one party, thus new initiatives have emerged which are pushing towards
2       Keerthi Thomas

greater transparency and openness, beyond the basic data protection practices.
In the UK for example, the government is leading a consumer empowerment
strategy in collaboration with leading businesses and consumer groups to give
individuals more access to, and control over the data, companies hold about
them [18]. While commercial sector players such as supermarkets, mobile net-
work operators, banks and other industries are being persuaded to publish their
consumer data, some of the giants of online services such as Google and Facebook
are leading by example. Google in particular has implemented its data liberation
principle [25] where users are allowed to download any data they create using
Google’s online services. Similarly, Facebook allows also its users to download
an archive of their online interactions as an ‘activity log’ [9]. While such top-
down approaches to releasing of consumer data are evolving quickly and gaining
more prominence, there are several important questions from a user’s perspec-
tive which remain unanswered: Are these data really beneficial to the users and
what do they think of it? If there are user-specific benefits, what is it likely to
be (apart from sample use-cases)? By publishing these datasets, do both users
and organisations face any risk to their privacy and security, and what are they?
What is its impact on privacy policies? In this paper, our goal is to provide
answers to some of these questions, focusing in particular on a significant, and
yet under-explored part of consumer data: Consumer-activity data, which here
refers to all data produced (Web access logs) as a consequence of a user’s in-
teraction with the websites or online software systems of a given organisation.
To achieve our objectives we exposed real users from a large organisation with
a complex web applications, to the potential mechanisms through which they
could not only access but also visualise their own consumer-activity data. To this
end, we designed a user study at The Open University (OU) relying on data col-
lected from the logs of the institution’s multiple online systems which captured
the online activities of a group of 12 users (researchers, students, tutors, admin
staff), for about 1 month, and provided them with access to such data through
a dedicated ‘personal analytics’ tool. We report here on our findings, namely,
the benefits in releasing the CA data, the privacy risks that are applicable to
both users and organisations and the impact consumer-activity data has on the
privacy policies which govern its use.


2   Background and Related work

In educational institutions, many systems store data about the actions of stu-
dents, teachers and researchers. In this regard, JISC [1] mentions two types of
data - (i) user activity data - a record of a user’s actions on a website or software
system or other relevant institutional service, and (ii) attention data - the record
of what a user has viewed on a website or software system or other relevant insti-
tutional service. Both these data are similar to consumer-activity data, in that
they are created from the user’s interaction with an online system or resource.
However, the difference is that consumer-activity data encompasses both user’s
‘action’ and ‘attention’ on an online system. In addition to this, unlike JISC
Privacy Implications of Online Consumer-Activity Data: An Empirical Study         3

whose focus is on educational institution, consumer-activity data universally
refers to ‘all’ types of organisations – academic, commercial and those belonging
to the government. In the UK, software systems which collect data are obliged to
provide protection under the Data Protection Act (DPA) if the data they collect
relate to and contain information about an individual’s identifiable attributes
(e.g. name, date of birth, etc.). ICO [14] highlights such data as being of two
types – personal data and sensitive personal data. Personal data relates to a
living individual who can be identified from those data and personal-sensitive
data refers to personal data pertaining to racial or ethnic origin, political opin-
ions, religious beliefs, memberships of organisations (e.g. trade union), physical
or mental health or condition, sexual life, convictions, etc. By nature, consumer-
activity data are associated with an individual and therefore contain informa-
tion that identify with the user who created it. The midata project [18] refers to
consumer-activity data as being personal data. Since the data profile from each
software system can be different depending on the type of functionality they
support, it is difficult to explicitly state beforehand what type of privacy issues
one might encounter and the type of privacy protection users might need.Apart
from the DPA, the current trend of publishing consumer-activity data is driven
by at least two principles stated in the OECD guidelines [19], one of them being
the Openness principle which states that organisations should have a policy of
openness about “developments, practices and policies with respect to personal
data”. In other words organisations are expected to be transparent in the way
they collect and process personal data. Another OECD’s Individual Participa-
tion principle states that individuals have a right to have visibility and access
to their data and if required they should be allowed to rectify, correct, complete
or erase the data held by others. This also ties in with the Integrity/Security
and Access/Participation principles of FIP [10] which mandates organisations
to make personal data visible to individuals so that its accuracy and complete-
ness can be contested if necessary. As a prolific collector of consumer-activity
data, Google has been exploiting the data to not only power its Web analytics
service [12] but also its targeted advertising. While some have evaluated the
role of Google Analytics in improving the usability of e-commerce websites [13]
and library services on the Web [26], others have shown how these analytics can
be extended to measure and improve the performance of websites [21, 22]. Other
parallel research efforts such as [7] and [8] concentrated on independently collect-
ing and visualising web usage as user-centric analytics purely from a technical
and architectural point of view, with one exception [6] where privacy and trust is
briefly analysed. Focusing on the distributed leakage of personal data from user’s
interactions across a wide variety of websites, some have demonstrated how “pri-
vacy footprints” can be measured and analysed [15]. Qualitative research such
as [16] captured Facebook activities to analyse its mobile privacy implications,
in another similar study [17], data from location-tracking activities were used to
elicit privacy concerns, in both works the emphasis was on the effect of user’s
mobility. Others works such as [2] have investigated how detailed tracking of user
interaction can be monitored using standard web technologies, but their focus is
4      Keerthi Thomas

mainly on usability evaluation of web applications outside the lab, Carl et al. [3]
record a computer user’s keystrokes and eye movements which they refer to as
“user activity data” in their cognitive research on natural language translators.
Unlike these, our study focuses on the implications of releasing consumer-activity
data back to the users. For the sake of brevity, from now on we synonymously
use the term CA data to refer to consumer-activity data.


3     Methodology

Following the trends and background described in the previous sections, the
user study presented in this paper started with the assumption that, eventually,
mechanisms will be put in place by organisations to give users access to their
CA data. We therefore decided to consider our own organisation, the OU, as a
testbed. From this, two main questions emerged:

 1. If consumer-activity data were made available to users of an organisation
    such as the OU, what would be its benefits and uses for individual users?
 2. What would be the implication of deploying such a mechanism in terms of
    privacy risks and policies, both for the users and the organisation?

    We investigated the above research questions through a qualitative study
which involved exposing participants to a ‘prototype’ tool to access and visualise
their CA data collected from the OU, and then using a combination of personal
interviews, online questionnaires and focus group to collect reactions, opinions
and concerns regarding the potential uses and implications such a tool might
have if properly deployed. The study itself was therefore divided into four phases
- data collection, personal interviews, focus group and analysis.


3.1   Overview of the data collection process

The Open University is the largest university in the UK (with approx. 250,000
enrolled students) and is based on distance learning, which means students study
at a distance and interact with the university staff (associated lecturers, course
team, administrator, IT help-desk, library) mainly through online systems. Thus,
the information architecture of the OU consists of a large variety of systems
which have web interfaces, produce logs and are centralised within the IT de-
partment of the university.
    During the data collection phase, participants were recruited through adverts
on the intranets, mailing-lists and word-of-mouth. Upon their written consent,
the IT team of the OU was notified with the participant’s identifying details
(their computer user-name), to enable them to collect (extract) the CA data
from server logs of different online systems for a period of 28 days.
    Even though OU was collecting user’s CA data on various online systems,
there was neither any requirement nor any mandate for them to release these data
to their users, thus significant efforts were required on our part to make all the
Privacy Implications of Online Consumer-Activity Data: An Empirical Study       5

stakeholders (IT security department, data protection officer, ethics committee)
understand the process of the user study, so that IT could provide us with
relevant data in a usable format.
    The core of our methodology depended on the UCIAD technology plat-
form [7] which linked and integrated heterogeneous data from several online
systems within OU. Using this underlying platform, a set of GUI tools were de-
veloped on top of it in the form of a personal analytics dashboard where a user’s
activities were displayed in the form of graphical visualisations.
    During the personal interview, the participants made use of personal ana-
lytics dashboard to view their past activities on various online systems of the
OU. Using these visuals as a trigger, the participants were asked questions on
how they felt with regards to data licensing, privacy and data protection policies
which apply to such information. The personal interview was unstructured and
open ended because we wanted to explore if the participant wanted to use the CA
data in other creative ways not envisaged by us. Two weeks after the interview, a
short online questionnaire (with 3 questions) was sent out to all the participants
of the study who had already given their personal interview, this was mainly
to gauge any change in their behaviour with regards to the use of OU’s online
systems. The topic we were investigating had the potential to produce divergent
and conflicting views, therefore we included a focus group which was designed
in a way where participants could resolve their views in relation to others. To
make the focus group interesting and challenging, its design was informed by the
initial results we obtained from the personal interview. The participants were
compensated for their time with a £30 Amazon voucher. The qualitative data
from the personal interview was analysed using Grounded Theory [4] techniques
and then triangulated with the results from both the online questionnaire and fo-
cus group. The findings of the study are described and discussed later in another
section.

3.2   Participants and data
The user study had 12 participants who had enrolled on a voluntary basis and
were regular users of the OU’s online systems. Participants broadly fell into four
categories of users: post-graduate students (3), academic staff (3), academic-
related staff (3) and administrative/management staff (3). The post-graduate
students are PhD students who were located on the main campus of the uni-
versity. The academic staff included both lecturers and researchers in differ-
ent departments (i.e., Computer Science and Arts). The academic-related staff
mainly consisted of technical staff who worked on IT development projects. Ad-
ministrative and management staff were from different departments, including
the university library and research school. This selection of participants was
intended to reflect the different types of usage of the OU’s online systems. How-
ever, compromises had to be made on the male to female ratio. Hence, there were
4 female participants and 8 males. The participants’ age-group approximately
ranged between 25 to 55 years of age. For each participant, we collected infor-
mation regarding their usage of OU websites through web server logs associated
6         Keerthi Thomas

with these different systems. This required filtering these logs to keep only the
data related to the participants of the study. This data collection mechanism was
run over a period of four weeks (or 28 days), leading to the 12 datasets (one per
participant) that included information about access and requests to OU web-
sites. This information is encoded in a format similar to the one of Apache logs,
in a text file where each line correspond to a request to a Webserver, including
the following pieces of information:

          
       

    This information was collected from 9 different servers, which corresponded
to the virtual learning environment (6 servers), the intranet, the public website
and the student services website of the OU. As expected, information collected
for different participants varied widely, depending on their roles.


3.3     Technology platform

To realise the study, a data processing and visualisation platform was required
to first process and integrate the data obtained from server logs into easily
interpretable and exploitable CA datasets; and secondly, to create an interactive
interface, i.e. a personal analytics dashboard, which allowed users to visualise
and interact with their own CA data. To process and integrate server logs into
CA data, we reused and employed the principles and tools developed as part of
the UCIAD project (see [7] for a summary) which had facilities to:

    – Convert and integrate the data from their Web server log format into RDF [23],
      using the schema provided by the UCIAD ontologies.
    – Create ontology level definitions for different types of resources and activities
      which are then used to process and categorise the traces of activities for the
      different users.
    – Realise additional ad-hoc processing of the data, to improve interpretation
      and visualisation. For example, geographic location of the user was derived
      by passing the IP address to external Web services /APIs. Similarly, human
      readable labels for the user agents (e.g., “Chrome 8”) were extracted from
      complex user agent strings found in the logs.
    – Create a data endpoint for each of the participants based on the generated
      and processed RDF datasets. In this case, we used the Fuseki triple store [11],
      creating one separate data repository for each participant.

    Once the data was processed and made available through data endpoints,
the personal analytics dashboard from UCIAD provided the end-user interface
to visualise and access these data. Normally, the logs from Web servers are used
to produce web analytics (i.e. website usage statistics) and many users are aware
of such analytical tools and interfaces associated with web analytics. We were
thus motivated to design the personal analytics dashboard so that it used similar
Privacy Implications of Online Consumer-Activity Data: An Empirical Study          7

visualisation methods, but showed (in an aggregated form) information about
the user’s visits to various websites. These visualisations include information
about the types of activities performed on online systems, the resources accessed,
locations of the user at the time of usage, the browsers and operating systems
used, the time, etc. Figure 1 shows a screenshot of the dashboard. It is important
to also mention that these visualisations are interactive, as they allow the user
to define filters on the activities (e.g., only show activities realised at a certain
location, or at a given time) by clicking on the corresponding chart elements.




      Fig. 1. Screenshot of the UCIAD personal analytics dashboard for a user.

3.4   Interviews based on tool usage
The core of the user study was the personal interview conducted with each par-
ticipant to explore their reactions and views on consumer-activity data, firstly
as a potential user of the system and secondly in relation to anything that they
may discover which was either interesting or worrying as they used the tool. The
interview started with a small introduction to the study, and initial questions
about the participant’s background, including their view on how they used OU’s
online services and their knowledge of web analytics. After this, each participant
was given access to a computer running the UCIAD personal analytics dash-
board, which accessed the data endpoint specifically created for them. As they
8         Keerthi Thomas

used and explored the tool, they were asked to answer a number of questions
related to the following topics:
    – Usage of consumer-activity data (“Did you find anything surprising/interesting
      in the data?”, “Would you like to be given access to this data?”)
    – Data gathering issues (“Where you aware such data was being collected?”,
      “Is there anything in what you see causing you concern?”)
    – Activity data policies (“Who owns the data and how should it be handled?”)
    The interview was conducted by two project team members: while one mem-
ber asked the questions (guided/prompted by the usage and exploration of the
personal analytics dashboard), the other team member took notes of the inter-
view. Interviews generally lasted between 45 minutes and 2 hours. Each interview
was audio recorded and stored in encrypted containers (to protect privacy) and
later transcribed for analysis (the transcripts were anonymised before use).

3.5     Online questionnaire and focus group
An online questionnaire was sent to participants two weeks after the initial in-
terviews in order to check whether additional thoughts and reactions would have
emerged after the interviews. These questionnaires did not lead to additional in-
sight, except to confirm that the participants had not changed either their views
or their behaviour with regards to how they perceived and used OU’s online
systems. The last phase of the study was the focus group. We included a focus
group because all participants were from diverse backgrounds, roles and views,
there was value in letting them debate their positions and viewpoints with oth-
ers in a controlled setting such as a group discussion. To be effective and yet
finish within the allotted time of 90 minutes, the focus group was constructed
around a specific task: the group members had to collaborate with each other
and contribute towards writing a business case supporting the deployment of
a CA data service at the OU. This business case included two main sections:
(i) benefits of CA data and (ii) obstacles to the deployment of CA data services
(which indirectly pointed to privacy and security issues). During the focus group
discussion, participants were asked to identify points to add to both these sec-
tions, and to react to the points made by others. Each point raised was recorded
on a common document (the “draft business case”), which was being projected
on a screen for all the participants to see, as a member of the UCIAD team was
adding notes to it, so that it was corrected and validated collectively at the time
the notes were made.

4      Findings
In this section, we summarise the main results of the study, as a set of general
findings, mainly regarding the benefits, privacy risks and challenges relating to
potential users of CA data within an organisation such as the OU. These obser-
vations were obtained through analysing the transcripts of the recordings made
during the individual interviews, together with the results of the discussions
summarised from the focus group.
Privacy Implications of Online Consumer-Activity Data: An Empirical Study         9

4.1   Potential benefits in releasing CA data

As previously described, CA data is a by-product of the user’s interaction with
the organisation’s online systems. Therefore, one of the fundamental question
the user study aimed to answer is whether users will be sufficiently interested to
request a copy of the data and put it to any use. It was interesting to note that
the participants came up with several innovative suggestions as to how they may
be able to use their activity data to better themselves:
    Self reflection: Some participants of the study saw value in simply being
able to reflect on what they do daily, it was more of a self assurance tool to
indicate if they were on track with their actions or not: “yeah...it is for self
reflection, it can show you if you are on track...that you are not deluding yourself
as to what you are actually working on” [P3]. This notion of self-reflection on
one’s own activities is generally related to the one of lifelogging [20] (see in
particular [5] that discusses this idea in relation to web interactions).
    Self improvement: Another interesting benefit for CA data the participants
suggested was to improve their use of online resources. This is very closely related
to self-reflection and it is an outcome of what they see as “inefficiencies” in
their use of the resources. For example, users find it frustrating not to have
bookmarked certain pages they use quite frequently and to have to search for
them: “I kind of realised this is not very effective and every time I need that
information about the poster...[I want to] bookmark it, you know...to myself and
then I never do...but when you look at something like this...you go like...put some
effort and bookmark it” [P5].
    Trace anomalies: Very similar to how application logs have long served
the purpose for analyses and tracing error conditions in the software, users find
the CA data useful to trace anomalies in their work pattern. For example, they
can trace back their online activities to specific websites whose security had
been compromised: “I can think of an impact around sort of data security and
personal security and it being able to flag up... doing something on a browser that
might not be so secure as you might want or looking at websites that are actually
security risks...to actually be able to go back and track back to say actually did
I use that website at that particular time...that’s the type of thing that will be
useful” [P7].
    Promote transparency: In the context of the organisation (the Open Uni-
versity), participants felt that the very act of releasing the CA data to the users
would significantly improve the organisation’s reputation: “if you have a view of
the data who is holding, I think it makes...it reassures the users I suppose, this
is what the OU [organisation] knows, this is what you know...that in a way quite
equal” [P10]. Participants also feel that CA data can provide transparency on
a personal level, for example, the users are able to justify or deny their actions
using the CA data as evidence: “as long as I am doing my job I don’t think its
a problem...but I suppose...I am trying to think...my time in Berlin is accounted
for so I am quite happy that people know that I was in Berlin” [P4] and “I am
not...you know...I am not doing anything that I wouldn’t be able to stand up and
defend in court if I had to” [P9].
10      Keerthi Thomas

    Although these benefits may not be extraordinary in themselves, it shows
that users when given an opportunity to access their CA data, they are likely to
find innovative and creative means to benefit themselves. Given that users are
likely to exploit their CA data, we focus on the potential privacy and security
risks, both the users and the organisation, are likely to face especially with
regards to key questions such as: (1) In what ways will the user’s privacy be
harmed by this data? (e.g. in case of data leakage); (2) Can the organisation
suffer any privacy harm? (e.g. if the user voluntarily shares the data on social
media) and; (3) What must be done to avoid harms from (1) and (2)?

4.2   Privacy risks: users and organisation
First, we describe the privacy risks. While these privacy risks may not be com-
pletely new, it does underline and confirm that even in a trusted producer-
consumer relationship (as in this case), such privacy threats exist, and it must
be addressed.
    Image distortion: Distortion seems to be a major concern. Distortion refers
to disseminating false or misleading information about an individual [24]. The
user study showed three contributors to distortion: incompleteness, inaccuracy
and incorrectness of data. In the interviews, one of the most commonly expressed
concern related to the incompleteness of CA data. Some participants felt the in-
dicated usage were skewed and incorrect because it excluded data from several
other online sources, and the resulting incomplete picture presented them in
negative light. For example, one user had indicated the missing CA data could
be misunderstood as them “not working”, when in fact their activities had been
captured elsewhere which was not included in the study: “I would like to be able
to put in my wider browsing history and quite like as well to pull in my calendar
schedule of meetings stuff like that and just see how much time I actually spent
because like I said there times when I in a meeting or out somewhere else...so my
use of particular systems or web in general may not appear so intense as perhaps
it is... so I quite like to see that” [P9]. The users were worried such incorrect
inferences made from incomplete data could hurt their reputation. The second
contributor to the distortion threat was the inaccuracy of data, a common exam-
ple related to the location of the user displayed on the map for each activity. The
location of the user was derived from the IP address of the computer used in the
interaction, but the location derived through this method was often misleading
because the service which interpreted the IP address was grossly inaccurate in
specific cases. In one particular instance, the location pointed to another city
the user hadn’t visited before, only adding to the fear that, without the user’s
input, the inferences could be wrong and harmful. The third contributing factor
to the distortion threat is incorrect metadata. More precisely it refers to the in-
correct classification and categorisation used in producing the analytics. When
there is a mismatch between how the user and the system have classified an ac-
tivity, it can have negative implications for the user: “it [category/classification
of activity] would have to be interpreted very carefully. So obviously it might be
somebody, some social scientist who might be researching pornographic sites and
Privacy Implications of Online Consumer-Activity Data: An Empirical Study           11

its impact on society” [P5] ; in this case the social scientist could be mistaken
for someone watching and accessing adult material during office hours.
    Unwanted disclosure: Disclosure involves the revelation of an individual’s
true information to others impacting her reputation [24]. Even if the data was
distorted, its impact is limited when the data is not shared with others: “it’s
fine because I am looking at it [data] and I know how to interpret it, if somebody
else looked at this...like you didn’t know why I spent so much time on the eBook
thing...yeah, I don’t like the idea that people can simply look at it and interpret it
in the way it wasn’t correct” [P3]. It’s understandable that users are worried that
the incorrect inferences could be viewed by individuals within the organisation
or elsewhere who can make decisions that are detrimental to the user’s interest
“it can be quite misleading judging from this...this isn’t my work, this isn’t in
my working day and if people are going to be basically making decisions based
on this...yeah... it makes me nervous” [P4]. The point here is that the users
understand the context of their work and their activities and expect to be able to
interpret them. The underlying concern however is that if this data is accessible
to others (within the organisation or outside), they might misjudge the user’s
reputation based on the partial data.
    Re-identification: Identification refers to linking information to particular
individuals [24]. Whether the inferences are accurate or not, generally users don’t
seem to be overtly concerned if the CA data is anonymised, but it does cause
concern if they can be identified or the inferences can be associated with them
on an individual basis: “I don’t mind if it is aggregating vast data...its perfectly
sensible in my book but its when they are drilling down on into individual usage
or even individual departments usage then I’ll get a little bit more nervous I
think.” [P4]. Although the data is anonymised, there is always a possibility that
the users might be identified through triangulating with other datasets.
    Opaqueness of passive data collection: Exclusion is failure to provide in-
dividuals with notice and input about their records [24]. Currently, organisations
provide privacy notices when personal data is being collected but it is usually
in general terms and, as such, they are not obliged to specifically mention the
nature of CA data their online systems generate. Although organisations may
be highly trusted and may even be seen in favourable light, there is an implicit
expectation that the organisation would be as open and transparent as possible:
“in principle I don’t have a problem but I guess just thinking about it...yeah...who
is collecting the information for what purpose...I think I’d want to be made aware
of that...before the information is collected not afterwards...just nothing...apart
from being kept in the picture” [P4]. At this stage it is not very clear if the users
will change their behaviour after being made aware of data collected by their
online interactions, but nevertheless they would like to be notified.
    Insecurity from compromised infrastructure: Insecurity is failure to
protect personal data [24]. Assuming organisations decide to make the CA data
available for the users to download, this would require special secure channels to
be opened. However, it still entails a risk where malicious attackers may obtain
a copy of the data: “yes, it would nice to have access to it, I’ll be interested but if
12      Keerthi Thomas

you set-up a system like that you open a gate, no matter how secure it is...there’s
a risk people who should not see...I mean other people will be trying, someone
might break-in or whatever” [P2]. Organisations spend a lot of their resources in
securing their infrastructure, the risk may be lower when compared to how users
will be able to protect their copy of CA data once it has be retrieved from the
organisation’s system. In this respect, the user is far more vulnerable for attacks
and data leakages through loss of storage devices or laptops.
    Breach of personal and organisation’s confidentiality: Breach of con-
fidentiality is unauthorised revelation of confidential information resulting in loss
of trust [24]. The CA data not only contains references to the user but also holds
information about the organisation, which can be exploited by the user (e.g. dis-
gruntled employee) to damage the organisation’s reputation: “what you wouldn’t
want to happen is say if you were the employer that we make this available to
people and suddenly they find a way to use to the detriment of the organisation.”
[P8], especially in cases where the users decide to upload the content on to so-
cial networking services. This would breach the trust the organisation has on its
employees.
    3rd party exploitation of data: Secondary use is the use of data for
purposes unrelated to the purposes for which it is collected [24]. During the study,
many of the users were not aware of their data being collected by external third
party services such as Google analytics [12]. While Google Analytics aggregates
data from several users of particular online systems to provide usage statistics,
they also exploit the data they gather for their own purposes, to create user
profiles which can help them in targeting advertisements. In the study, not many
of the users were aware and wanted Google to collect their data, although some of
them, without being fully aware of the consequences, had added Google Analytics
to their project web page/websites. Third party use of CA data can only be
external, since users cannot be expected to be able to develop tools to process
the raw CA data and will be dependent on third party services to analyse these
data. In such a scenario, the CA data which contains information provided by the
organisation about the user’s activities might be used for purposes not intended
by the user or the organisation. This therefore constitutes a potential privacy
risk for both of them.

4.3   Privacy policy models: ownership and licensing
At a higher level, organisational privacy policies govern how the data is collected,
processed and to what use it will be put to. It relates to ownership and licensing
of data, here ownership is concerned with controlling who can change the access
permissions and usage rights on the data. The owner who holds the ownership
rights may allow other entities to access and use the data by issuing a license (i.e.
terms and conditions for use). The CA data may reside in the organisation or the
user’s domain (or both) and privacy risks apply differently depending on who
has control over the data. For an effective access control policy, the access rights
and ownership rights have to be clearly defined. However, in the study we found
that users have no clear consensus on ownership rights that should be applied to
Privacy Implications of Online Consumer-Activity Data: An Empirical Study        13

CA data. Here we briefly describe the three potential models participants relate
to:
    (1) The organisation owns the data: Currently, this is seen as the default
model. The organisation create the data, therefore they retain the rights and
exploit the data as their own asset, and some user’s acknowledge and support
this position: “I think the OU owns it...its the University’s data, its information
about me but University compiled it and they must own it...as an individual I
don’t own the data, I am in that camp...some people may say they generated
the data, no [they] didn’t, the University happened to collect the data of my
usage, I didn’t create the data” [P8]. Since the organisations are constrained by
the DPA which specifies under what conditions the data can be used without
having to obtain consent from the user, delivering CA data would effectively
mean issuing a license to the user to not only access their own data but also
use it in ways they deem fit. The CA data not only contains details about the
user but also of the organisation. Assuming organisations redact the CA data
before releasing it, there is always the risk that some confidential information
can be triangulated through other means. It is therefore in the interest of the
organisation to impose restricted licensing conditions that make it difficult for
the CA data to be aggregated and used in ways that are determental to its
reputation. This of course raises an important question - what type of restrictions
should the organisation be allowed to impose on the released CA data, given
that users will want to exploit its benefits? The majority of study participants
indicated that they would expect the organisation to impose strong limitations
on the use of the data (e.g., it could not be shared or it could only be accessed
through the tools provided by the organisation).
    (2) The user owns the data: The second model represents a radical shift
from the default model described above. In this model, the participant’s sugges-
tion is for the users to have full rights over the CA data once it is transferred to
the individual: “if the data is given back to me then I’d expect I am the owner
of it...that the people who had originally collected that data have passed the own-
ership to me to do what I like” [P7]. The participants reason that, since the
CA data represents personal information, it is only natural for the individual
to be in control of its use. One major implication of this approach is that the
organisations by default will not have the right to either access or exploit the
CA data in any form, unless it has been explicitly licensed by the user. There-
fore, the important question here is - what type of restrictions should the user
be allowed to impose on the organisation’s use of CA data? Allowing users to
impose conditions could make it more complex for organisations to genuinely
use the CA data to improve their online services and, as a consequence, make
the collection of CA data irrelevant and counter-productive for organisations.
   Another reason to transfer the ownership to the user is that the DPA indi-
cates that personal data might not be kept by an organisation for longer than
necessary for the intended purpose. Since organisations keep CA data for a short
period (at the OU, the data is kept for 7 consecutive days) and then deleted,
one possible approach would therefore be to transfer the data together with its
14      Keerthi Thomas

associated rights to the user after this retention period. From a practical perspec-
tive, this would also represent a convenient arrangement, as users could simply
‘opt-in’ for their CA data to be sent to them at the time it is deleted from the
organisation’s systems. Of course, this would still imply that the organisation
would lose control once the data has been transferred to the user, and “non
detrimental use/confidentiality” clauses might be attached to the delivery of CA
data.
    (3) Shared data ownership: The third suggestion is for a hybrid approach
where both the organisation and the user are considered as co-owners of CA data.
This means that both will have full access to the data, but would be prevented
from using them in ways that could be detrimental to each other. Study par-
ticipants proposed this to be an appropriate model: “it is shared data almost
isn’t it, like pay scales and salaries and things...that is OU data as well...as to
what they pay their staff, so its not just mine to do whatever I want...its like a
shared ownership” [P3]. The legal implementation of such an arrangement could
be rather complex, or on the contrary it might be simpler than anticipated.
For example, the organisation can comply with the DPA just as they do when
the CA data in their domain, while at the same time, the users can be bound
to confidentiality agreements with the organisation, thus preventing them from
widely distributing these data. In any case, new terms and conditions will have
to be drawn, so the important questions are: what are these joint terms and
conditions? Will there be any flexibility in negotiating them? How will conflicts
be resolved?
    We found that there was no clear consensus among the participants on any
of these ownership and licensing models described above. We also found that
participants were not in position to articulate and elaborate on the terms and
conditions that might entail in each of the three models so that both, users and
organisations, can mutually benefit from the use of CA data.


5    Discussion and conclusion

The user study had a sample size of 12 participants and as such it was not ex-
pected to be representative of all CA data users, instead, the aim was to explore
and provide initial insights on the potential impact the releasing of CA data will
have on both the users and the organisations. In this regard, our study showed
how users, when given an opportunity, can come up with ingenious ways to ex-
ploit and benefit from CA data. Further, the study uncovered several potential
privacy risks which are applicable to both the users and the organisation. Al-
though the discovered privacy risks are not new, the study showed that these
risks were dormant even in datasets generally not visible to the users. More im-
portantly the privacy risks relates to both the user and the organisation and
here we briefly discuss the impact of these findings with respect legislation and
data protection principles. DPA controls how organisations, businesses and the
government process data and it primarily applies to personal data. This throws
open an important question: should CA data be considered as personal data?
Privacy Implications of Online Consumer-Activity Data: An Empirical Study             15

In our study, CA data contained traces of the user’s identity, for example, the
online system’s user-name but this in itself was insufficient to link to any indi-
vidual within the organisation (without having access to other directory systems
within the OU). Everything in the DPA pertaining to data access rights of both
the users and the organisations then hinges over this classification. Considering
CA data as personal data can have far reaching implications under the DPA:
(a) users will be allowed to demand a copy of their data unlike the current sit-
uation where organisations are encouraged to release CA data voluntarily, (b)
organisations will be forced to make explicit how they process the data and for
what purposes it is being used, and (c) organisations will be unable to allow 3rd
party use without explicit consent from the user. Current ambiguity attached
to the nature of CA data means organisations are not obliged to provide notice
and obtain the user’s consent regarding (b) and (c). For other implications, the
DPA in its current form appears to be inadequate. For example, the CA data, in
addition to containing traces of user’s identity, also contains confidential infor-
mation about the organisation’s internal structure, therefore the question arises:
what data access rights do organisations have after the data has been transferred
to the user’s domain?, which of course requires further debate and legislation.
One limitation of our study relates to generalisability - the OU, like many other
educational organisations, collects CA data on different systems to improve its
services but we were not aware of it integrating these data and employing ana-
lytics to derive benefits from it to the extent it was done in the study.
     In conclusion, it’s important to not only understand and anticipate the bene-
fits of releasing consumer-activity data but crucially its privacy implications. To
this end, we conducted an exploratory qualitative study in the context of a large
educational institution in which users were exposed to a novel ‘personal analyt-
ics’ dashboard designed to provide access and visualise their consumer-activity
data. From observing the reactions of the study participants, we not only un-
covered several privacy risks applicable to both the users and the organisation
but also highlighted several challenges relating to data ownership and licensing
models which must be addressed if consumer-activity data is to be released.
Acknowledgments
Thanks to Dr. Mathieu D’Aquin for his significant contribution in realising the
user study and his useful insights on post study analysis. The study was partly
achieved thanks to funding from JISC for the UCIAD (User Centric Integration
of Activity Data) project – see http://uciad.info for details.

References
1. Activity Data, JISC, UK, http://tinyurl.com/6lcb5xw
2. Atterer, R., Wnuk, M., Schmidt, A.: Knowing the user’s every move: user activity
   tracking for website usability evaluation and implicit interaction. In: Proceedings of
   the 15th international conference on World Wide Web, pp. 203-212. ACM, (2006)
3. Carl, M., Jakobsen, A.L., Jensen, K.T.H.: Modelling human translator behaviour
   with user-activity data. In: 5th Intl. Workshop on Natural Language Processing and
   Cognitive Science. (2008)
16      Keerthi Thomas

4. Corbin, J., Strauss, A.: Basics of Qualitative Research, Techniques and Procedures
   for Developing Grounded Theory. Sage Publications (2008)
5. d’Aquin, M., Elahi, S., Motta, E.: Semantic monitoring of personal Web activity
   to support the management of trust and privacy. In: Proceedings of the WebSci10:
   Extending the Frontiers of Society On-Line. (2010)
6. d’Aquin, M., Elahi, S., Motta, E.: Semantic monitoring of personal Web activity to
   support the management of trust and privacy. In: SPOT 2010: 2nd Workshop on
   Trust and Privacy on the Social and Semantic Web. (2010)
7. d’Aquin, M., Elahi, S., Motta, E.: Semantic technologies to support the user-centric
   analysis of activity data. In: SDoW 2011, Social Data on the Web: Workshop at the
   10th International Semantic Web Conference (ISWC) (2011)
8. Elahi, S., d’Aquin, M., Motta, E.: Who Wants a Piece of Me? Reconstructing a
   User Profile from Personal Web Activity Logs. In: International ESWC Workshop
   on Linking of User Profiles and Applications in the Social Semantic Web. (2010)
9. Explore Your Activity Log (Facebook), www.facebook.com/help/activitylog
10. Fair Information Practice Principles, www.ftc.gov/reports/privacy3/fairinfo.
   shtm
11. Fuseki: serving RDF data over HTTP, http://jena.apache.org/documentation/
   serving_data/index.html
12. Google Analytics, www.google.com/analytics
13. Hasan, L., Morris, A., Probets, S.: Using Google Analytics to Evaluate the Usability
   of E-Commerce Sites. In: Kurosu, M. (ed.) Human Centered Design, vol. 5619, pp.
   697-706. Springer Berlin Heidelberg (2009)
14. Key definitions of the Data Protection Act, Information Commissioner’s Of-
   fice(ICO), UK, http://tinyurl.com/8f5wlbp
15. Krishnamurthy, B., Wills, C.E.: Generating a privacy footprint on the internet. In:
   Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pp.
   65-70. ACM, (2006)
16. Mancini, C., Thomas, K., Rogers, Y., Price, B.A., Jedrzejczyk, L., Bandara, A.K.,
   Joinson, A.N., Nuseibeh, B.: From spaces to places: emerging contexts in mobile
   privacy. In: Proceedings of the 11th Intnl. conference on Ubiquitous computing, pp.
   1-10. ACM, (2009)
17. Mancini, C., Rogers, Y., Thomas, K., Joinson, A.N., Price, B.A., Bandara, A.K.,
   Jedrzejczyk, L., Nuseibeh, B.: In the Best Families: Tracking and Relationships. In:
   Proceedings of the 29th Intnl. Conference on Human Factors in Computing Systems,
   ACM CHI 2011. ACM Press, (2011)
18. Midata: access and control your personal data, www.bis.gov.uk/policies/
   consumer-issues/personal-data
19. OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal
   Data, http://tinyurl.com/bgojzhu
20. O’Hara, K., Tuffield, M., Shadbolt, N.: Lifelogging: Privacy and empowerment with
   memories for life. Identity in the Information Society 1, 155-172 (2008)
21. Plaza, B.: Monitoring web traffic source effectiveness with Google Analytics: An
   experiment with time series. Aslib Proceedings 61, 474 - 482 (2009)
22. Plaza, B.: Google Analytics for measuring website performance. Tourism Manage-
   ment 32, 477-481 (2011)
23. Resource Description Framework (RDF), www.w3.org/RDF
24. Solove, D.J.: Understanding Privacy. Harvard University Press, London (2008)
25. The Data Liberation Front, www.dataliberation.org
26. Turner, S.J.: Website Statistics 2.0: Using Google Analytics to Measure Library
   Website Effectiveness. Technical Services Quarterly 27, 261-278 (2010)