=Paper= {{Paper |id=Vol-2482/paper45 |storemode=property |title=A User Experience Model for Privacy and Context Aware Over-the-Top (OTT) TV Recommendations |pdfUrl=https://ceur-ws.org/Vol-2482/paper45.pdf |volume=Vol-2482 |authors=Valentino Servizi,Sokol Kosta,Allan Hammershøj,Henning Olesen |dblpUrl=https://dblp.org/rec/conf/cikm/ServiziKHO18 }} ==A User Experience Model for Privacy and Context Aware Over-the-Top (OTT) TV Recommendations== https://ceur-ws.org/Vol-2482/paper45.pdf
           A User Experience Model for Privacy and Context Aware
                  Over-the-Top (OTT) TV Recommendations
                         Valentino Servizi                                                    Sokol Kosta
                 Technical University of Denmark                                    Aalborg University Copenhagen
                     Kgs. Lyngby, Denmark                                               Copenhagen, Denmark
                          valse@dtu.dk                                                     sok@cmi.aau.dk

                       Allan Hammershøj                                                    Henning Olesen
                            Mediathand                                             Aalborg University Copenhagen
                       Copenhagen, Denmark                                             Copenhagen, Denmark
                       allan@mediathand.com                                              olesen@cmi.aau.dk
ABSTRACT                                                               at the highest possible level, since the user was anonymously and
Conventional recommender systems provide personalized recom-           passively connected to the network.
mendations by collecting and retaining user data, relying on a            With the introduction of an Internet-based "return path" for
centralized architecture. Hence, user privacy is undermined by         voting or rating of programs, initially in Digital Video Broadcast
the volume of information required to support the personalized         set-top-boxes, and later with Over-the-Top (OTT) TV [15], a com-
experience. In this work, we propose a User Experience model           pletely different scenario has been set in terms of personalization
which allows the privacy of a user to be preserved by means of a       and privacy. As IP-based services are taking over, almost any net-
decentralized architecture, enabling the Service Provider to offer     work today assigns an IP address to each node [15], and once the
recommendations without the need of storing individual user data.      connection is established, the network allows bidirectional commu-
We advance the current state of the art by: i) Proposing a model of    nication between each and every pair of nodes (e.g. between user
User Experience (UEx) suitable for Persona-based recommendations;      and service provider).
ii) Presenting a UEx collection model which enhances the user             By removing the constraint of one-way communication and
privacy towards the service provider while keeping the quality of      assigning a unique address to each node of the network, service
her preferences predictions; and iii) Assessing the existence of the   providers can easily collect detailed information from each node to
Persona profiles, which are needed for generating and addressing       build user profiles, and the level of personalization of the service
the recommendations. We perform several experiments using a            theoretically has no limit [22]. In the best case scenario, users are
real-world complete dataset from a medium-sized service provider,      only in control of the data collected actively, while information
composed of more than 14,000 unique users and 33,000 content           collected passively, such as user preferences and consumption pat-
titles collected over a period of two years. We show that our ar-      terns for example, seem to be out of their control. OTT TV services
chitecture, in combination with our UEx model, achieves the same       are defined as providers of the content that is usually associated
or better results, compared to state-of-the-art systems, in terms of   with traditional broadcast television, but over the Internet. While
rating prediction accuracy, without sacrificing user’s privacy.        this may sound suspiciously like IPTV (and they do share similar
                                                                       underlying technologies), it is in fact not the same thing: with IPTV
                                                                       customers pay the Internet Service Providers (ISPs) for the service,
                                                                       but with OTT TV the ISPs simply provide access to the Internet
                                                                       and, thereby, to the desired OTT TV service (which represents a
                                                                       different entity). Compared to broadcast TV, the common practice
                                                                       of OTT providers of gathering information about the content users
                                                                       consume raises the issue that personalization comes at the cost of
                                                                       privacy, causing concerns for the users, which are being exposed
                                                                       to over-disclosure of their personal data and viewing habits [26].
                                                                          Many countries are pushing towards the concepts of privacy by
                                                                       design and privacy by default, in particular in the European Union
1   INTRODUCTION                                                       (EU), driven by the General Data Protection Regulation (GDPR) law,
In the early days, broadcast television was a one-to-many rela-        which strongly highlights the concepts of linkability and personal
tionship. The signal traveled one-way from the content provider        data [10].
towards the consumer, and so did the contents. User privacy was           Inspired by these GDPR principles, and focusing on a solution
                                                                       that could ethically improve the “status quo”, we here present the
Copyright © CIKM 2018 for the individual papers by the papers'         following contributions:
                                                                       (1) We design a mathematical User Experience Model that allows a
authors. Copyright © CIKM 2018 for the volume as a collection
                                                                       Service Provider to collect User Experience in a privacy-aware sys-
by its editors. This volume and its papers are published under
                                                                       tem that keeps personal information under the user’s control within
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
her domain and a sanitized database within the Service Provider’s                      The intuition behind our solution is that a provider can satisfy
domain.                                                                                the privacy of individual users by providing recommendations to
 (2) We validate the User Experience Model, which is used for user                     groups of users with similar characteristics. The solution is based
classification, by computing a type of rating based on weighted pre-                   on the concept of Persona and Locked Persona Profiles. Succeeding
dictions with a many-fold comparative experiment, testing various                      in solving this challenge is not only beneficial for the users, but
similarity metrics and prediction algorithms.                                          also convenient for the service providers. We provide a solution by
 (3) We perform extensive experiments using a real-world full                          breaking such a problem into the following research questions:
dataset from a medium-sized service provider, composed of more
                                                                                             • How to turn the passive collection of consumption data
than 14,000 unique users and 33,000 content titles collected over a
                                                                                               identified by the ontology User <-> consumes <-> content,
period of two years. We use the Root Mean Squared Error (RMSE)
                                                                                               into a model of User Experience (UEx)?
to compare the performance of our solution with state of the art
                                                                                             • How to collect consumption patterns while avoiding the
algorithms such as the FunkSVD, which is the winner of the popular
                                                                                               linkability to the related personal records?
Netflix competition, ItemItem and PearsMean [8]; We further apply
                                                                                             • How to provide Persona profiles from such a model of User
the Lenskit tool [8], which includes the three algorithms, to our
                                                                                               Experience?
dataset as a benchmark for the algorithm we propose. Not only the
                                                                                             • Do Persona profiles exist in the OTT TV context?
Service Providers, but in particular the Content Providers (CP) have
access to sensitive user data. The (CP) has access to the personal
information of its paying customers; therefore, she can also link                      2.1     Background
them to any k-anonymous data set maintained by the partnering                          Digital Rights Management (DRM) technology has been specified,
Service Provider (SP), which the CP could access, probably by con-                     designed, and implemented in order to fight piracy. As such, satis-
tract. In order to avoid this linkability hazard, we will prove that                   fying DRM sets a minimum level of possible privacy for the user.
our solution raises the anonymity shield on the user consumption                       Yang et al. [28] describe an approach resting on the allocation of
at the SP level and hence also at the CP level.                                        an anonymity UserID for authentication, which allows anonymous
    The results show that our User Experience Model achieves the                       access to DRM protected contents. In [11], Intertrust supplies OTT
same or better results, compared to state-of-the-art systems, while                    TV Service Providers with a DRM Service named Express Play (EP),
offering the potential for drastically increasing the user privacy.                    which relies on an architecture that allows anonymous access to the
It enables Service Providers to accurately classify users’ tastes by                   licenses necessary to decrypt the contents delivered via a Content
storing only a fraction of the users consumption data and thereby                      Delivery Network (CDN), by use of bearer token technology. This
reducing drastically the linkability hazard. Moreover, we show how                     demonstrates that it is possible for a user to access DRM-protected
clustering techniques applied to the User Experience Model of                          content while keeping her anonymity.
an OTT TV user base leads to the description of distinct Persona                           In order to provide a personalized recommendation to any anony-
profiles, defined as homogeneous groups of individuals whose con-                      mous user accessing DRM protected contents, the design approach
sumption patterns and motivation can be represented as a set of                        theorized by Cooper et al. [5] seems extremely relevant. According
statements derived from quantitative measures [6].                                     to [6], there are three questions that the Persona approach attempts
    The concept of Persona is necessary for the anonymous user                         to address. With a slight adaptation to the goal of this project these
classification according to Persona profiles. Service personalization                  can be stated as follows: i) What different sorts of people are us-
can then be made using the Persona profile to which a user belongs                     ing OTT TV services? ii) How do their needs and behaviors vary?
when accessing the service.                                                            iii) What ranges of behavior and contexts need to be explored?
    Our solution aims at assisting providers in fulfilling the obliga-                 Adapting their example to the OTT TV service, we might identify:
tions enforced by law. A privacy-enhanced design of the recom-                         i) Frequency of access to the service; ii) Whether the user likes or
mender system can provide a competitive advantage in at least two                      dislikes the consumed contents; iii) The motivation for accessing
aspects for providers: i) Lower expected costs for implementing                        the service, e.g. information, education, or entertainment.
future system updates in order to comply with new regulatory re-                           In [16], Hussain et al. introduce the concept of Locked Persona
strictions; and ii) Lower risks for privacy litigations, such as the                   Profiles (LPP), which can be “used to generate a set of unlinkable
case settled by Netflix for $9 million in favor of its customers1 .                    proofs of ownership” while accessing a service. Locked Persona
                                                                                       Profiles are perfectly compatible with both the need of protecting
2     RELATED WORK                                                                     users’ privacy and the need to protect operators from misuse threats
                                                                                       such as piracy. Furthermore, Locked Persona Profiles would allow
In this section, we present the related work of the privacy and
                                                                                       the user to access the service without being linked, thus giving the
context-aware Recommender Systems, providing also the techno-
                                                                                       individual control over her personal information with no chance
logical and regulatory background of the privacy problem which
                                                                                       for the Service Provider (SP) to access this information without the
we define as follows:
                                                                                       user’s knowledge or consent [16].
        How can we enable OTT TV providers to drastically                                  A Recommender System usually needs to collect some amount
        enhance their users privacy by augmenting existing                             of personal information about the user in order to provide the rec-
        technologies, protocols as well as recommender systems?                        ommendation. This includes attributes, preferences, contact lists,
                                                                                       among others [23]. Context information is not a part of the user
1 Case No. 11-cv-00379 (U.S. District Court for the Northern District of California)   profile, but may also need to be protected, e.g. the user’s current
location. Kim Cameron’s first and second law of identity [20] em-           using partial information obtained e.g. by colluding with some users
phasizes user control and minimal disclosure, and the OAuth 2.0             in the network, to attempt reverse engineering the entire dataset.
framework [17] can help to put the user in control. OAuth 2.0 al-              The main solutions described in the literature to preserve pri-
lows the user to grant restricted access to her personal information        vacy are the following. (1) “Privacy preserving approach based on
towards a Service Provider and defines a protocol for securing appli-       peer to peer techniques using users’ communities” with recommenda-
cation access to her protected resources, such as identity attributes,      tions generated on the client side without involving the server [9].
through Application Programming Interfaces (APIs) [17].                     (2) “Centralized recommender systems by adding uncertainty to the
   Several methods have been proposed in the past for reducing              data by using a randomized perturbation” [9], where such a perturba-
linkability – here defined as the possibility of discovering a rela-        tion could be achieved using e.g. differential privacy [12]. (3) “Stor-
tionship between a user’s consumption records or between a user’s           ing user’s profiles on their own side and running the recommender
consumption and her identity [16]. k-anonymity occurs when “ev-             system in distributed manner without relying on any server” [9].
ery tuple in the microdata table released is indistinguishably related to   (4) Hybrid recommender system that uses secure two-party proto-
no fewer than k respondents” [4]. This can be achieved by removing          cols and public key infrastructure. (5) “Agent based middleware for
identifiable information from the dataset. Crowd Blending relies on         private recommendations service” [9] presenting good performance
storing records about a group of similar users as a unique entity [13].     in terms of Mean Average Error but also tuning issues in order to
Differential Privacy relies on storing similar consumptions of differ-      get a good coverage on the largest part of the user’s population.
ent users by the same perturbed record [12]. Zero-Knowledge relies          However, most of the solutions focused on protecting users from ex-
on representing the consumption of the whole user population by             ternal security attacks rather than reducing their exposure towards
storing only a sample [14]. This approach is particularly interest-         the service provider.
ing for two reasons: It seems to be more effective than the other
techniques mentioned, and the cluster analysis necessary to ex-             3     OUR USER EXPERIENCE MODEL
ploit possible Persona profiles on a Zero-Knowledge dataset could           The first step towards achieving the architecture described in the
be carried out directly, because the sampling happens beforehand.           previous section is to understand who is consuming what, when,
Therefore, the stored data do not need further sampling.                    where, and whether she liked what has been consumed (or not). Our
                                                                            dataset consists of more than 700,000 ZAP events, representing the
2.2    User Experience                                                      fact that a user taps on a content, 14,000 unique users, and 33,000
User Experience (UEx) is very important for the service providers.          content titles collected over a period of two years by a medium-size
Considering the specificity of the OTT TV application field, the            OTT TV provider.
User Experience should be exploited to provide users with good                 We define the User Experience Model (UEx) as a tensor:
recommendations about linear or Video on Demand (VoD) media                                               m
                                                                             U Ex : C 1m1 ⊗ C 2m2 ⊗ · · · ⊗ Ck k ⊗ P mp 7→ IRm1 +m2 +···+mk +mp (1)
contents. Therefore, the interest is not the User Experience for OTT
                                                                                                  m
TV service itself, but rather the User Experience about the content         where each of the C x y represents a vectorial space of the x th
available in the Electronic Program Guide (EPG).                            context in which the user zaps into media contents, and P mp rep-
   As such, service providers define several metrics related to User        resents the vectorial space of the media contents that she zaps;
Experience:                                                                 m 1 , m 2 , · · · , mk , mp are the dimensions of each vectorial space.
Service Consumption. In order to have any experience with the                  In a traditional recommender system users may provide ratings,
media content, users need to consume the content [25].                      and Latent Semantic Analysis (LSA) [3] using the Term Frequency
Context of consumption. The context in which the interaction                Inverse Document Frequency (TFIDF) matrix can be applied to
with the service happens contributes to the experience about the            the contents’ descriptions. Instead of ratings, we propose to use
content consumed in such a context [25].                                    a similar approach for the TV broadcasting scenario, where the
Subjective Perceived Value. A positive or negative experience               representation of contexts and TV program weights for each user
results from the contribution of multiple drivers [21].                     act as "ratings" in the User Experience model, and we introduce the
Usage Cycles. The experience might be determined by cycles of               concepts of Zap Frequency Inverse Context Frequency (ZFICF) and
interactions [27], and the next cycle might be influenced by the            Zap Frequency Inverse Program Frequency (ZFIPF). The difference
quality of the content itself, based for example on the personal            is that while ratings are collected actively, our User Experience
experience [24] or on the experience of other individuals [27].             is computed from the passive collection of the zaps into contents,
User Behavior Frequency. In [18], based on a data set provided              which define the components of the User Experience. Further details
by BBC, evidence has been presented about each user repeated                are given in the following.
access to a small amount of items compared to the total amount of
available items.                                                            3.1    The User Experience as a vector
                                                                            Linearizing the tensor defined in Eq. (1), we obtain the User Expe-
2.3    Privacy enhanced recommender systems                                 rience matrix presented in Eq. (2). The matrix consists of the Zap
According to the review presented in [9], privacy concerns are              Frequency Inverse Context Frequency, one slice for each context,
mainly related to: (1) an attacker which correlates obfuscated data         and the Zap Frequency Inverse Program Frequency (last slice). Each
about user with data from other publicly-accessible databases in            slice, except the last one, represents one of the k possible contexts
order to link users with the sensitive information, or (2) an attacker      of consumption, while the last slice represents the TV programs
         U Ex(1)     Z F ICF 1,1,1              ···    Z F ICF 1,1,m1   ···    Z F ICF 1,k,1    ···    Z F ICF 1,k,mk     Z F IPF 1,1   ···   Z F IPF 1,mp
       © U Ex(2) ª © Z F ICF                    ···    Z F ICF 2,1,m1   ···    Z F ICF 2,k,1    ···    Z F ICF 2,k,mk     Z F IPF 2,1   ···   Z F IPF 2,mp
                                                                                                                                                             ª
                              2,1,1                                                                                                                          ®
                 ®=­
       ­         ® ­
            ..             ..                   ..            ..         ..          ..         ..            ..               ..        ..         ..       ® (2)
       ­                                                                                                                                                     ®
             .              .                      .           .          .           .            .           .                .         .          .
       ­         ® ­                                                                                                                                         ®
       ­         ® ­                                                                                                                                         ®
       « U Ex(n) ¬ « Z F ICFn,1,1               ···    Z F ICFn,1,m1    ···    Z F ICFn,k,1     ···    Z F ICFn,k,mk      Z F IPFn,1    ···   Z F IPFn,mp    ¬

Equation 2: Linearized representation of the tensor described in Eq. (1). It is composed of slices corresponding to each of the
five features, i.e. the four contexts (i) Time of Day, (ii) Time of Week, (iii) Time of Month, and (iv) Time of Year, see Eq. (3),
plus one (v) concerning the TV-Programs consumed by the user, see Eq. (4). Each row maps the i th user experience UEx(i).
Therefore, just by looking at the distances between the users represented, this model allows measuring the similarity between
their experiences against the TV-Program consumed within the contexts.

                                                                                        Ímk
consumed. Within each slice, except the last one, each column rep-                       j=1 Z i jk is the total amount of zaps recorded for user i in context
resents the mtkh split of the k t h context, while each column of the                  k during the (same) time span.
last slice represents one of the TV titles (or programs).                                 The second factor is the Inverse Context Frequency (ICF) recorded
    Each row represents the experience of the i t h user in the time                   for user i in context k, where Tot Ck = mk , as already mentioned,
span of her interest. However, it could also represent the experience                  is the total number      of segments composing the context k; and
of the i t h component of a cluster (Persona) in the time span of
                                                                                                        
                                                                                       ACik = card j | Z i jk , 0, ∀j ϵ [1, mk ] indicates the number
Service Provider interest. Therefore, the experience of each user
                                                                                        of segments within the context k, in which the i th user zapped at
can be collected as a vector having the same representation of one
                                                                                        least once.
row of the matrix. This vector would be the summary of the User
Experience and could be easily collected by the Service Provider                        3.2.2 Zap Frequency Inverse Program Frequency. The Zap Frequency
and be put under the user’s control by using OAuth 2.0 on the zaps-                     Inverse Program Frequency (ZFIPF) relates to the last mp compo-
to-contents. Besides, if the users mapped in this way are arranged                      nents of the User Experience tensor defined in Eq. 1. It is similarly
in clusters, each cluster would represent homogeneous experiences.                      defined by the following equation:
If such homogeneous groups exist, they allow for user classification                                                     Zi j         Tot P
as well as a “set of statements” derived from the measures, fitting to                                   Z F IPFi j =           · log       , where:             (4)
                                                                                                                        Tot Z i        APi
the definition of a Persona.
                                                                                               • i denotes the i th user, where i = 1, ..., n;
3.2     User Experience Components                                                             • the j th element indicates the j th TV program, where j = 1,
                                                                                                 ..., mp . The total amount of TV programs available on the
In this subsection we elaborate on the meaning and the nature
                                                                                                 EPG is Tot P = mp .
of the User Experience components, and discuss the concepts of
Z F ICF and Z F IPF .                                                                   The first factor of Eq. (4) is the Zap Frequency (ZF) for the i th
                                                                                        user on the j th program, where Z i j represents the amount of zaps
3.2.1 Zap Frequency Inverse Context Frequency. The Zap Frequency                        recorded during the time span of interest for user i on the TV
Inverse Context Frequency relates to the first m 1 + m 2 + · · · + mk                                                Ímp
                                                                                        program j, and Tot Z i = j=1       Z i j is the total amount of zaps of
components of the User Experience tensor defined in Eq. (1). We
                                                                                        user i within the (same) time span.
define the Zap Frequency Inverse Context Frequency (ZFICF) for
                                                                                           The second factor represents the Inverse Program Frequency
user i on the context segment j of context k as:
                                                                                        (IPFi ) for user i in the time span of her interest, where Tot P
                              Z i jk            Tot Ck                                  is the total amount    of TV programs
             Z F ICFi jk =              · log          , where:               (3)                                                   available on the EPG, and
                                                                                        APi = card j | Z i j , 0, ∀j ϵ 1, mp
                                                                                                                         
                             Tot Z ik           ACi jk
                                                                                                                                      
                                                                                                                                        indicates the number of
                                                                                        programs for which user i zapped at least once within the (same)
      • i indicates the i t h user, where i = 1, ..., n;                                timespan.
      • the k t h context can refer alternatively to Time of Day, Time                     It is important to note that the ZFIPF over a large interval of time
        of Week, Time of Month, Time of Year, or Device (D), i.e. k                     relies on the grouping of the items, especially in environment where,
        = 1, ..., 5.                                                                    for example, TV series or news programs might recur continuously.
      • j indicates the j t h segment of the k t h context, j = 1, ..., mk .            If the grouping is not done properly, then the measure will be
        The k t h context is divided into a total number of segments                    affected by a large error.
        equal to Tot Ck = mk . For example, Time of Day could be
        divided in 24 segments, thus mT imeof Day = 24, one per                         4      EVALUATION
        hour; but it could also be divided in 2 segments as day and                     In this section, we provide several experiments that prove the exis-
        night, in such case mT imeof Day = 2.                                           tence of Persona profiles based on User Experience. They can be
The first factor of Eq. (3) is the Zap Frequency (ZF) recorded for the                  effectively computed using a Zero-Knowledge Consumption Data
user i in segment j of context, where Z i jk represents the amount                      Base (DB), which we define as a collection of data sampled on-line
of zaps recorded during the time span of interest, and Tot Z ik =                       from the user consumption feedback in such a way that only a
portion having size below 10% of the consumption data is retained                     popular algorithms implemented within the last updated version
and stored in the DB in order to fulfill the definition provided for                  of Lenskit [8], to the ZFIPF computed on the whole same dataset.
Zero-Knowledge in [14].
                                                                                      4.1.1 Nearest Neighbor Mean, User Experience classification after
    As the results of these experiments, we show and validate that:
                                                                                      SVD using Cosine and Chebyshev metrics (setup and execution). In
i) User Experience is effective for classifying a user according to
                                                                                      order to compute the User Experience vector for each user from the
such Persona profiles; ii) Zap Frequency Inverse Program Frequency
                                                                                      zaps recorded in the dataset, in particular for computing Z F ICFi jk ,
predictions computed by looking for similar users from a sample of
                                                                                      we have set up each context, where: i) Time of Day context is
the dataset are reasonably accurate;. Finally, we provide an example
                                                                                      divided in 4 segments: Morning, Noon, Evening, Night; ii) Time
of Persona profile, obtained by clustering a sample of the dataset,
                                                                                      of Week context is divided in 3 segments: Week Days, Saturday
according to the definition stated in the introduction.
                                                                                      and Sunday; iii) Time of Month context is divided in 2 segments:
    We use a real-world dataset, provided by a Media Company,
                                                                                      Far From Pay Check and Close To Pay Check (which has been
which provides both Video On Demand and Linear TV on behalf of
                                                                                      considered at the end of the month); iv) Time of Year context is
a Content Provider2 . The data can be represented in a minimalistic
                                                                                      divided in 4 segments: Spring, Summer, Autumn, Winter.
and generic way by the following: , where: contractId is the ID assigned to
                                                                                      to Users [23] while generating User Experience from the dataset, the
the contract by the Content Provider, e.g. at the start of subscription,
                                                                                      Minimum Zap Threshold (MZT) for collecting both the Model User
in order to authenticate the user, referring to one or possibly more
                                                                                      Sample (MUS) and the Target User Sample (TUS) has been set to 25
users active with the same contract; contentId is the ID assigned to
                                                                                      zaps after a number of preliminary experiments. As for the Cold
the media content by the Content Provider, e.g. the TV Network;
                                                                                      Start Problem related to Contents [23], for collecting the Target
deviceId is the ID of the user’s device authenticated while accessing
                                                                                      Content Sample (TCS) we set the MZT to 1000 zaps. After applying
the content, this is also used in combination with the contractId to
                                                                                      Singular Value Decomposition (SVD) [3], for User classification we
identify a single user (userId); deviceType identifies e.g. Mobile, PC
                                                                                      look for max 4 Nearest Neighbors (KNN) comparing the results
or TV; and zap_timestamp is the timestamp determining when the
                                                                                      obtained by using Chebyshev and Cosine metrics.
user switches the TV channel (it is an attribute of the ontology user
                                                                                         The User is represented by her User Experience Vector which is
- zaps into - content).
                                                                                      a row of the matrix in Eq. 2. Thus, we classify each User belonging
    At this stage, for the purpose of organizing users in homogeneous
                                                                                      to the Target Users Sample represented as in Eq. 2 against each of
groups, it is necessary selecting data about any user consuming
                                                                                      the relevant users belonging to Model Users Sample represented as
contents alone and filter out data about any user consuming con-
                                                                                      in Eq. 2. Each random sample counts 10% of the population from
tents together with other individuals, because while the first group
                                                                                      which it has been extracted [29]. To make predictions with target a
represents a clean signal we want to analyze, the second is affected
                                                                                      relevant amount of users having Z F IPF , 0 at least for one element
by noise. Therefore, since it seems reasonable to consider mobile
                                                                                      of the Target Contents Sample (TCS). Therefore, when collecting
and tablet devices as personal and separate them from PCs and
                                                                                      Target Users Sample make sure to avoid any overlapping with
TVs, which are considered social devices. Distinguishing these de-
                                                                                      Model Users Sample (MUS) adopted for computing the predictions.
vices based on the screen size3 , from the full dataset we restrict
                                                                                      The experiment has been repeated 600 times (600-fold) and each
the data only to mobile devices and only keep the following in-
                                                                                      of the times every sample (TCS, TUS, MUS) has been randomly
formation: . The
                                                                                      collected again from the whole dataset, first splitting the MUS from
 resulting dataset is composed of 743975 Zaps4 , 14518 Users, and
                                                                                      the rest of the data.
33357 Content titles consumed over 2 years.
                                                                                         The user classification and the ZFIPF prediction have been ac-
                                                                                      complished by the following steps. i) Compute the User Experi-
4.1     Persona classification and rating prediction                                  ence vector for each user of both the Target Users Sample and the
The purpose of this experiment is to verify the consistency of the                    Model User Sample. ii) Store the Zap Frequency Inverse Program
User Experience modeled in Eq. 1 by predicting 90% of the users’                      Frequency (ZFIPF) regarding each media content within the TCS
preferences employing only 10% of them for the computation. Suc-                      from the User Experience Vector (UEV) of each user belonging
ceeding in such a challenging purpose will demonstrate that Service                   to the Target Users Sample in a new array (NA) and then set to
Providers could serve recommendations by maintaining only 10%                         zero the ZFIPF within the UEV origin. iii) For each user belonging
of the data currently retained about the users consumption. The                       to Target Users Sample, compute the K-Nearest Neighbor belong-
experiment consists in two steps: i) Testing the consistency of                       ing to Model Users Sample having the relevant Zap Frequency
our solution by predicting the Zap Frequency Inverse Program Fre-                     Inverse Program Frequency not-null, using all the mentioned met-
quency (ZFIPF) and measuring the RMSE. We use 90% of the dataset                      rics. iv) For each metric, compute a Zap Frequency Inverse Program
to find target users and 10% for computing the prediction. ii) Testing                Frequency as mean of the values available from KNN. v) Compute
the consistency of the User Experience Model by applying three                        the Root Mean Squared Error (RMSE) of each prediction from the
                                                                                      NA. The random prediction has been generated using the range
2 Data provided by Mediathand, company based in Denmark, which operates the OTT       Z F IPF ∈ [0, max(NA)] instead of Z F IPF ∈ [0, ∞], therefore it is
TV service on behalf of Glenten.                                                      more accurate.
3 We assume that the chance of a user watching some content together with others is
directly proportional to the screen size.                                             4.1.2 Lenskit Experiment. Lenskit is a very popular and effective
4 Event recorded when a user taps into a content.
                                                                                      tool developed by the Grouplens from the Minnesota University [8].
                       1.2
                             RMSE
                        1

                       0.8

                       0.6

                       0.4

                       0.2

                        0
                               PearsMean         Item          FunkSVD    NN-MEAN SVDcos NN-MEAN SVDcheb    RANDOM


Figure 1: Boxplot of the RMSE resulting from (i) Lenskit 5-fold experiment - PearsMean, ItemItem, FunkSVD - on the Zap
                                                                                                           4       4
Frequency Inverse Program Frequency calculated using the whole dataset as Z F IPF : TV Proдrams 10 7→ IR∼10 and (ii) the 600-
fold experiment (see Sec. 4.1.1) run using the whole dataset to predict ZFIPF as the Mean of the Target User Nearest Neighbors
(NN-MEAN) against a random sample having 10% of the dataset size; the neighbors are based on Cosine or Chebyshev after SVD
                                                                                                                          4        4
of the User Experience Model as U Ex : Time o f Day 4 ⊗Time o f W eek 2 ⊗Time o f Month 2 ⊗Time o f Year 4 ⊗TV Proдrams 10 7→ IR∼10 .


It encapsulates several recommender systems algorithms such as             third main eigenvectors of the eigenbasis as latent dimensions (see
ItemItem, PearsMean and FunkSVD. In particular, these have been            Figure 2). From these results, we conclude that the second algorithm
taken into account for this experiment. The experiment was run on          seems to be the right choice, since Shape, Density, and Size of
the ZFIPF computed using the whole dataset and provided in the             the potential clusters seem irregular [1]. About the estimation of
same format of the Movielens ratings. The setup of the experiment          the amount of clusters, in order to avoid over-fitting it is possible
is pretty straightforward and involves the selection of the ratings        to rely on a selection of algorithms for cross-validation, such as
domain, which we set in the range Z F IPF ∈ [0, max(NA)]. The              Calinski-Harabasz (CalHar) [2] and Davies-Bouldin (DavBou) [7].
precision has been set as maximum resulting from ZFIPF. Through            In particular, these two methods have been used to restrict the
some preliminary experiment we found the parameters ensuring the           interval of probable amount clusters from [1, 60] to [2, 30] and the
lowest RMSE. For example, the amount of features for FunkSVD,              latter interval has been used for setting up the final experiments
which is set to 40 by default, performs very poorly, while at 10           carried out using Davies-Bouldin.
seems to achieve the best performance. Lenskit completes a 5-fold
validation on the whole dataset. The RMSE results of the experiment        4.2.1 MUS clusters estimation. It is a starting point in the descrip-
are around 0, 4 and are presented in Figure 1.                             tion of Persona profiles according to the definition provided in the
                                                                           Sec. 2.1. Moreover, thanks to this experiment, we converge towards
4.1.3 Experiment Outcome (Figure 1). We aim at proving the consis-         the most probable amount of clusters in order to provide the top list
tency of the User Experience Model by comparing RMSE resulting             of contents for each cluster as well as Time of Day, Time of Week,
from the experiment run with Lensikit against the results harvested        Time of Month and Time of Year, as features to characterize Persona
using our solution. From the cross-validation based on the same            profiles from the Model Users Sample which here is representing
dataset, it is possible to notice as a first result that the model pre-    the Zero-Knowledge Consumption DB . The results of Algorithm 1
dictions outperform the random assignment and are aligned to the           are presented in Table 1.
state of the art algorithms provided by Lenskit. Moreover, from
these results we can notice that the similarity metrics used for clas-     4.2.2 Experiment Conclusion. As we can notice from Fig 2, some
sification do not influence the RMSE, they all yield closely the same      clusters could be merged or discarded towards the best compromise.
results. However, we chose Chebyshev, because e.g. compared to             Perhaps a thorough cluster analysis would reduce the amount of
Cosine it could identify clusters where users are closer to a normal       relevant clusters to the minimum algorithmic estimation. Neverthe-
distribution.                                                              less, since the sampling applied to obtain the Model Users Sample is
                                                                           simulating the Zero-Knowledge Consumption DB, irrelevant clus-
4.2    Persona profiles from Zero-Knowledge DB                             ters might represent important seeds for new clusters and new
This experiment aims to finding distinct Persona Profiles. Using           Pesona profiles could grow from there. Therefore, we considered
mostly the same set up of the previous experiments, therefore, a sam-      relevant keeping clusters below 2% of the total population as seeds
ple of users having 10% of the whole population’s size, this experi-       for potential new Persona profiles.
ment should take into account the following challenges: i) Which
clustering algorithm is the best fit for finding distinct groups of        5   CONCLUSIONS AND FUTURE DIRECTIONS
similar users? ii) How to choose the amount of clusters in order to        Our analysis and experiments have shed light upon significant addi-
avoid over-fitting?                                                        tions to the solution of the problem as formulated and the following
   To choose between k-means [1] and hierarchical clustering [19],         statements are an attempt of summarizing these contributions: (a) It
we perform a qualitative analysis of the Model Users Sample plotted        is possible to provide Persona-based recommendations for OTT TV,
after Singular Value Decomposition, choosing the first and the             because apparently homogeneous group of users exist. (b) Based
Table 1: Persona profiles. They are described as a set of statements, one for each of the features. The statement about Contents
has been formulated looking at those having the highest Mean Z F I P F and the lowest StDev Z F I P F within the cluster.
Statement f eatur e : T V −P r oдr ams , Statement f eatur e :T oD , Statement f eatur e :T oW , Statement f eatur e :T oM , Statement f eatur e :T oY

Name                         Size      Persona profile
Clust. 6                     59%       Addicted to News Cont ent . Mainly active during the EveningT oD , Mostly during the week, never SaturdayT oW and slightly more during SummerT oY .
Clust. 2                     17,6%     News is first, then fitnessCont ent . Mainly active during Evening (less at Night and Noon)T oD , Mostly during week, never SaturdayT oW .
                                       slightly more far from Pay CheckT oM , Summer and AutumnT oY
Clust. 9                     5,9%      Passionate about Sport News, Talent Show and Cars, Interested in Crime, some times cartoons (perhaps for intertaining her children)Cont ent .
                                       Mainly active during the Evening (never at Night)T oD Mostly during the week, some times in the weekend T oW , far from Pay CheckT oM and during WinterT oY
Clust. 1                     4,5%      Passionate about Cycling and Interested in FootballCont ent . Mainly active during Morning and Evening (some times at Night)T oD , Far from PaycheckT oM and SummerT oY .
Clust. 7                     4%        Passionate about Cooking and Drama, Interested in News and CrimeCont ent . Mainly active during Morning and Evening (never at Noon)T oD .
                                       Mostly during the week, never SaturdayT oW , Far from Pay CheckT oM and AutumnT oY
Clust. 8                     2,25%     Passionate about Fitnes and Interested in Sport News and NewsCont ent . Mainly active at Noon, never at NightT oD . Mostly Active in Winter, Never during Spring.T oY .
Clust. 3, 4, 5, 10, 12, 13   < 1, 8%   Seed Clusters



  Algorithm 1: It samples from the dataset and it computes the                                                 with different interests even when close to each other, such as those
  User Experience matrix (see Eq. 2). Then, after dimensionality                                               preferring sports Vs. fitness.
  reduction using Singular Value Decomposition, it estimates the                                                iii) We demonstrate that the quality of prediction of a popular
  amount of clusters and it detects the clusters using hierarchical                                            Recommender Systems such as FunkSVD, where the user privacy
  clustering. Finally, for each cluster it returns the measures on                                             represents a huge issue, is comparable with our predictions, where
  the features necessary for providing the Persona profiles.                                                   we potentially can keep the privacy at ZK-level since we need only
 1 function DetectPersonaProfiles;
                                                                                                               10% of the user data constantly stored on the service provider side
   Input : Dat aset , M ZT = 25 , Sampl inдCoe f f icient = 0.1 ,                                              and this would make extremely difficult linking any user to such
                Mean M in = 0.05 , St Dev M ax = 1 ,                                                           data set.
                Simil ar ityMet r ic = Chebyshev ,
                Clust er s Est imat ionMet hod = Dav Bou ,
                Clust er inдT ech = H ier ar chicalClust er inд;                                               5.1       Future Work
   Output : P er sona(Clust er Size , T O PCont ent s List ,                                                   Further work should also be prioritized towards the following focus
                mean(Z F I C FT oD ) , mean(Z F I C FT oW ) ,                                                  areas. i) Data sampling. It is a very critical component of both
                mean(Z F I C FT oM ) , mean(Z F I C FT oY ))
                                                                                                               Recommender System and Privacy enhanced architecture. More
 2 MUS = Sampling(MZT , Sampl inдCoe f f icient, Dat aset );
                                                                                                               sophisticated and effective options, such as the smart sampling pre-
 3 Zaps = CollectZaps( Dataset, MUS);
                                                                                                               sented by Google [29], could be investigated. ii) Recommender
 4 Z F I C F i jk = zficf(Zaps);
                                                                                                               System. We will extend the current work by designing and imple-
 5 Z F I P F ip = zfipf(Zaps);
                                                                                                               menting an effective Persona based Recommender System archi-
 6 User Experience =
                                                                                                               tecture. Indeed, the recommender system evaluation through the
    H or izont alConcat enat e(Z F I C F i jk , Z F I P F ip );
                                                                                                               computation of “precision” and “recall”, starting from the results
 7 [U,V,S] = SingularValueDecomposition(User Experience);
                                                                                                               presented in this project is quite a straightforward application from
 8 ClustersAmount = ClustersEstimationMethod(U, SimilarityMetric);
                                                                                                               the list of top programs of each user on a Target Users Sample and
 9 Clusters = ClusteringTech(U, ClustersAmount, SimilarityMetric);
                                                                                                               the lists of top programs derived from each Persona profile detected
10 for Cluster ∈ Clusters do
                                              El eme nt (Cl us t e r )                                         within a MUS. However, the filtering criteria, necessary to improve
11       return Clust er Size = 100∗count
                                        count El eme nt (MU S )        ,
                                                                                                               the recommendation, represents a critical work, the progress of
          T opCont ent s List =
          {Cont ent s | mean(Z F I P FCont e nt ) > Mean M in &
                                                                                                               which should be measured looking e.g. at diversity and/or serendip-
          St Dev(Z F I P FCont e nt ) < St Dev M ax },                                                         ity. iii) Persona Management. Persona life-cycle [16] in the User
          mean(Z F I C FT oD ), mean(Z F I C FT oW ),                                                          Domain differs from the Service Provider domain. For example,
          mean(Z F I C FT oM ), mean(Z F I C FT oY );                                                          from the SP’s perspective, Persona may exist and cease to exist at
12 end                                                                                                         any time. Nonetheless, when they cease to exist they could also
                                                                                                               resurrect to serve another user. Persona may appear as seeds within
on the ontology represented by the dataset, it is possible to define a                                         the working User Experience sample and grow, then mature and
model of User Experience that can be collected and updated by the                                              become dominant in some context and it could become strategic
Service Provider in samples (as e.g. the Model Users Sample) which                                             as solution of cold start problems concerning new users. From the
no longer would be linkable to users. Therefore, it would be possible                                          user’s perspective, alternative Persona could be used to access the
for the Service Provider to collect a privacy-aware Zero-Knowledge                                             same service in different contexts and always get the best recom-
Consumption DB. In particular: i) We defined a mathematical rep-                                               mendation "Privacy and Ethically Enhanced".
resentation of User Experience suitable for the specific application
field, which has been implemented and tested thoroughly with di-
mensionality reduction and hierarchical clustering.                                                            ACKNOWLEDGMENTS
 ii) We showed that, in this application field, Personalization-based
                                                                                                               The authors would like to acknowledge the help provided by Hos-
on Persona-profiles is possible, homogeneous groups of users exist,
                                                                                                               sain Ahmad and Naoufal Medouri.
and familiar clustering techniques can distinguish between users
                                               0.5




                                               0.4




                                               0.2




                                     eigen-3
                                               0.1
                                                             Clust.1
                                                             Clust.2
                                                             Clust.3
                                                             Clust.4
                                                 0
                                                             Clust.5
                                                             Clust.6
                                                             Clust.7
                                                             Clust.8
                                               -0.1          Clust.9
                                                             Clust.10
                                                             Clust.11
                                                             Clust.12
                                                             Clust.13
                                               -0.2

                                                                                                                                                      -0.05
                                                                                                                                               -0.1
                                                                                                                                           -0.15
                                                      -0.3   -0.25      -0.2   -0.15   -0.1   -0.05      0    0.05    0.1    0.15   -0.2               eigen-1
                                                                                               eigen-2


Figure 2: The User Experience computed on a random sample having size 10% of the full dataset. After dimensionality re-
duction via Singular Value Decomposition, each user has been plotted according to the three main eigenvectors, hierarchical
clustering and Chebyshev.

REFERENCES                                                                                            [15] Ilsa Godlovitch, Bas Kotterink, J. Scott Marcus, Pieter Nooren, Jop Esmeijer, and
 [1] David Arthur and Sergei Vassilvitskii. 2007. K-Means++: the Advantages of                             Arnold Roosendaal. 2015. Over-The-Top players (OTTs): Market dynamics and
     Careful Seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on                           policy challenges. European Union. 137 pages. DOI:http://dx.doi.org/10.2861/
                                                                                                           706687
     Discrete algorithms 8 (2007), 1027–1025. DOI:http://dx.doi.org/10.1145/1283383.
                                                                                                      [16] M. Hussain and D. B. Skillicorn. 2011. Mitigating the linkability problem in
     1283494
                                                                                                           anonymous reputation management. Journal of Internet Services and Applications
 [2] T Caliński and J Harabasz. 1974. A dendrite method for cluster analysis.
                                                                                                           2, 1 (2011), 47–65. DOI:http://dx.doi.org/10.1007/s13174-011-0020-4
     Communications in Statistics-theory and Methods 3, 1 (1974), 1–27. DOI:http:
                                                                                                      [17] B. Michael Jones and Dick Hardt. 2012. The OAuth 2.0 Authorization Framework.
     //dx.doi.org/10.1080/03610917408548446
                                                                                                           (2012), 1–76. Retrieved 2016-12-17 from https://tools.ietf.org/html/rfc6750
 [3] Michal Campr and Karel Ježek. 2015. Comparing Semantic Models for Evaluating
                                                                                                      [18] Dmytro Karamshuk, Nishanth Sastry, Mustafa Al-Bassam, Andrew Secker, and
     Automatic Document Summarization. Springer International Publishing, Cham,
                                                                                                           Jigna Chandaria. 2016. Take-away TV: Recharging Work commutes with pre-
     252–260. DOI:http://dx.doi.org/10.1007/978-3-319-24033-6_29
                                                                                                           dictive preloading of catch-Up TV content. IEEE Journal on Selected Areas in
 [4] V Ciriani, S De Capitani di Vimercati, S Foresti, and P Samarati. 2007. K-
                                                                                                           Communications 34, 8 (2016), 2091–2101. DOI:http://dx.doi.org/10.1109/JSAC.
     Anonymity. Springer US. 36 pages. http://spdp.di.unimi.it/papers/k-Anonymity.
                                                                                                           2016.2577298
     pdf
                                                                                                      [19] Fionn Murtagh and Pedro Contreras. 2012. Algorithms for hierarchical cluster-
 [5] Alan Cooper. 2004. The Inmates Are Running the Asylum. Sams Publishing. 288
                                                                                                           ing: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
     pages. DOI:http://dx.doi.org/10.1007/978-3-322-99786-9{_}1
                                                                                                           Discovery 2, 1 (2012), 86–97. DOI:http://dx.doi.org/10.1002/widm.53
 [6] Alan Cooper, Robert Reimann, and David Cronin. 2007. About Face 3: The
                                                                                                      [20] Arun Nanda, Andre Durand, Bill Barnes, Carl Ellison, Caspar Bowden, Craig
     essentials of interaction design. Vol. 3. Wiley Publishing. 610 pages. DOI:
                                                                                                           Burton, James Governor, Jamie Lewis, John Shewchuk, Luke Razzell, Marc Canter,
     http://dx.doi.org/10.1057/palgrave.ivs.9500066
                                                                                                           Mark Wahl, Mike Jones, Phil Becker, Radovan Janocek, Ravi Pandya, Robert
 [7] D L Davies and D W Bouldin. 1979. A cluster separation measure. IEEE trans-
                                                                                                           Scoble, and Scott C Lem. 2005. The Laws of Identity. (2005), 13.
     actions on pattern analysis and machine intelligence 1, 2 (1979), 224–227. DOI:
                                                                                                      [21] Jakob Nielsen and Don Norman. 2015.              The Definition of User Expe-
     http://dx.doi.org/10.1109/TPAMI.1979.4766909
                                                                                                           rience.      (2015).     Retrieved 2016-12-17 from http://www.nngroup.com/
 [8] Michael D. Ekstrand, Michael Ludwig, Joseph A. Konstan, and John T. Riedl.
                                                                                                           about-user-experience-definition
     2011. Rethinking the recommender research ecosystem. Proceedings of the
                                                                                                      [22] Henning Olesen, Josef Noll, and Marlo Hoffman. 2009. User profiles, personaliza-
     fifth ACM conference on Recommender systems - RecSys ’11 (2011), 133. DOI:
                                                                                                           tion and privacy. Outlook, Wireless World Research Forum 3 (2009), 1–38.
     http://dx.doi.org/10.1145/2043932.2043958
                                                                                                      [23] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2015. Recommender systems
 [9] Ahmed M. Elmisery and Dmitri Botvich. 2011. An agent based middleware
                                                                                                           handbook. Springer. 1003 pages.
     for privacy aware recommender systems in IPTV networks. Smart Innovation,
                                                                                                      [24] Alistair Sutcliffe. 2009. Designing for User Engagement: Aesthetic and Attractive
     Systems and Technologies 10 SIST (2011), 821–832. DOI:http://dx.doi.org/10.1007/
                                                                                                           User Interfaces. Vol. 2. Morgan and Claypool. 1–55 pages. DOI:http://dx.doi.org/
     978-3-642-22194-1_81
                                                                                                           10.2200/S00210ED1V01Y200910HCI005
[10] European Union. 2016. Regulation 2016/679 of the European Parliament and the
                                                                                                      [25] David Sward and Gavin Macarthur. 2007. Making User Experience a Business
     Council of the European Union. Official Journal of the European Union (2016).
                                                                                                           Strategy. Towards a UX Manifesto 2, September 2007 (2007), 35–42. DOI:http:
     Retrieved 2016-12-17 from http://eur-lex.europa.eu/legal-content/en/TXT/?uri=
                                                                                                           //dx.doi.org/10.1183/09031936.00022308
     CELEX%3A32016R0679
                                                                                                      [26] The European Commission. 2011. SPECIAL EUROBAROMETER 359 Attitudes
[11] ExpressPlay. 2017. Key Storage - ExpressPlay. (2017). Retrieved 2017-04-26 from
                                                                                                           on Data Protection and Electronic Identity in the European Union. (2011), 330.
     https://www.expressplay.com/developer/key-storage/
                                                                                                           Retrieved 2016-12-17 from http://ec.europa.eu/public_opinion/index_en.htm
[12] Arik Friedman, Shlomo Berkovsky, and Mohamed Ali Kaafar. 2016. A differential
                                                                                                      [27] Julie R Williamson and Stephen Brewster. 2012. A performative perspective on
     privacy framework for matrix factorization recommender systems. User Modeling
                                                                                                           UX. Communications in Mobile Computing 1, 1 (2012), 3. DOI:http://dx.doi.org/
     and User-Adapted Interaction 26, 5 (12 2016), 425–458. DOI:http://dx.doi.org/10.
                                                                                                           10.1186/2192-1121-1-3
     1007/s11257-016-9177-7
                                                                                                      [28] Jen Ho Yang, Chih Cheng Hsueh, and Chung Hsuan Sun. 2010. An efficient and
[13] Johannes Gehrke, Michael Hay, Edward Lui, and Rafael Pass. 2012. Crowd-
                                                                                                           flexible authentication scheme with user anonymity for Digital Right Manage-
     Blending Privacy. (2012), 479–496.
                                                                                                           ment. Proceedings - 4th International Conference on Genetic and Evolutionary
[14] Johannes Gehrke, Edward Lui, and Rafael Pass. 2011. Towards privacy for so-
                                                                                                           Computing, ICGEC 2010 (2010), 630–633. DOI:http://dx.doi.org/10.1109/ICGEC.
     cial networks: A zero-knowledge based definition of privacy. Lecture Notes
                                                                                                           2010.161
     in Computer Science (including subseries Lecture Notes in Artificial Intelligence
                                                                                                      [29] Celal Ziftci and Ben Greenberg. 2015. GTAC 2015: Statistical Data Sam-
     and Lecture Notes in Bioinformatics) 6597 LNCS (2011), 432–449. DOI:http:
                                                                                                           pling. (2015). Retrieved 2016-12-15 from https://www.youtube.com/watch?
     //dx.doi.org/10.1007/978-3-642-19571-6{_}26
                                                                                                           v=cXi1Jo5V7UM