Publishing L2TAP Logs to Facilitate Transparency and Accountability Reza Samavi Mariano P. Consens University of Toronto University of Toronto samavi@mie.utoronto.ca consens@mie.utoronto.ca ABSTRACT RT1Auditor We propose publishing L2TAP privacy logs to facilitate pri- vacy auditing tasks that involve multiple auditors, an increas- Research Team L2TAP ingly common requirement in the context of social computing RT1 Audit Log RT1   Data Provider and big data driven science. Our proposal utilizes two on- Auditor Dataset tologies, L2TAP and SCIP, designed for deployment in a MIMIC II Linked Data environment. L2TAP provides provenance en- Data Provider Research Team External Auditor abled logging of events. SCIP synthesizes contextual integrity L2TAP PhysioNet RT2 concepts to express key privacy-related semantics associated Audit Log RT2   with log events. We describe SPARQL query-based solu- RT2 Auditor tions for privacy log construction, obligation derivation, and compliance checking. The solutions facilitate accountability and transparency among participants (privacy auditors in Figure 1: Privacy Auditing Scenario particular). 1. INTRODUCTION consider a research study that analyzes the primary reasons The protection of individuals’ privacy is becoming increas- for intensive care unit (ICU) hospitalization, examining the ingly more challenging in the era of social computing and effectiveness of different types of medications across patient data driven science. While privacy protection has impli- demographics (sex, age, and ethnicity). The scenario is cations in many application areas, it is clearly challenging depicted in Fig. 1. The dataset used for the study is MIMIC when health related data is involved. Big data enabled bi- II, a Clinical Database provided by PhysioNet [10], where ological and biomedical research involves massive datasets records contain information about the ICU admission of of human genome, biological imaging, and clinical informa- patients [16]. Although MIMIC II database is a de-identified tion collected and aggregated from individual health records. public dataset, the access is available only under terms of Protecting data subjects’ privacy in clinical research is a a data use agreement (DUA1 ). This DUA defines a list of concern addressed by multiple legislations and regulations. obligations that a researcher agrees to fulfill. Some of these For example, the U.S. Department of Health and Human Ser- obligations involves purposes and roles. For example, the vices (HHS) [27] obliges investigators to protect the privacy dataset should be used only for academic research purposes of data subjects and to maintain the confidentiality of data. and by a researcher (ob1 ). Some others are pre-obligations HHS also requires investigators to establish oversight mecha- which are actions that need to be performed prior to access. nisms and monitoring plans for research projects involving For instance, the DUA states a researcher should complete a human subjects, and to remain accountable to the subjects’ training program in human research subjects protections prior privacy rights. Auditing is essential to the enforcement of to access (ob2 ). There are also some post-obligations which accountability, and many scenarios involve auditors from are actions that need to be performed after access has been multiple institutions that monitor the fulfillment of privacy granted such as: If the researcher finds information within obligations. restricted data that she believes might permit identification of any individual, she will report the location of this information To illustrate the need for an audit mechanism that facilitates promptly by email (ob3 ). accountability and transparency among multiple participants, Two research teams (RT1 and RT2) are collaborating on this study. The teams could be based on related or unrelated research institutions (i.e., RT1 is based on a hospital, while RT2 is based on a university department, and the hospital could be part of the university, or not). The privacy policies mentioned above govern the access to MIMICII dataset. The policies are designed by PhysioNet according to HIPPA [26] and other privacy regulations in order to protect privacy Copyright is held by the author/owner(s). LDOW2014, April 8, 2014, Seoul, Korea. 1 http://physionet.org/works/mimic2cdb/access.shtml of individuals whose data are used in research studies. In tutional Review Board (IRB) for a single-site research and order to check if the research teams are compliant to these Data and Safety Monitoring Board (DSMB) for multi-site policies multiple potential auditors must be able to audit research), and the external auditor can check compliance. the log of access and fulfilment of obligations. One of the potential auditors is the Data Provider itself who oversees The paper structure and contributions are as follows. Section the contract to ensure the usage of data is accordance to the 2 provides an overview of L2TAP and SCIP and shows how agreement. The auditors from two teams may also want to the ontology can be used to capture the log events and their audit the process with respect to the internal data protection privacy semantics. Section 3 describes our SPARQL query- policies. In addition, an external auditor should be able to based solutions for constructing the log, obligation derivation, ensure the research involving individuals health data are fully and compliance checking. Section 4 describes the related HIPPA compliant.The challenge needs to be addressed is research. We conclude in Section 5. that while multiple participants are involved in generating privacy logs (e.g. the data provider, the research teams, and 2. L2TAP LINKED DATA LOG the researchers), all potential auditora should be able to In this section we first motivate the need for two ontologies check the log to see if the researchers are respecting privacy to generate privacy audit logs. Then using our motivating of data subjects. scenario we describe L2TAP, an ontology for specifications of the header of privacy log events, and SCIP, an ontology that In the past few years, we have observed multiple practical provides necessary specifications to encode privacy semantics proposals with focus on privacy of big datasets and linked of the body of log events. data ([22, 18, 7, 8]). The goal of these studies have been on access control frameworks that define who can access which The goal of L2TAP is to provide a set of classes and prop- resources. They achieve privacy through safeguarding of erties that can be used to represent and publish a log of data before the access is granted and provide no solutions privacy events as Linked Data. In the motivating scenario for privacy support after the access.There are also solid expressing access policies and obligations, requesting access theoretical and practical work to support privacy auditing to the dataset, fulfilling obligations are some of the typical (data usage control after access is granted). However they privacy events that we expect L2TAP to be able to capture. either exploit complex logic (e.g. [2, 9, 3, 5]) that jeopardizes their practical benefits or use system level logging standards L2TAP follows the principles of Linked Data [11] to publish (e.g.[15, 6]) to generate privacy audit logs on an application by logs. Everything in the l2tap:Log is expressed in terms of application basis, thus generating privacy logs not exploitable some l2tap:LogEvents. URIs are used as names for logs, log by multiple participants and auditors in an heterogenous events, participants, and processes. Thus, the log events can environment. be published as web dereferenceable URIs by participants. Participants who want to dereference a published log are In [23] we proposed L2TAP ontology (Linked Data Log to authenticated and communicated via a secure https channel. Transparency, Accountability and Privacy) that allows par- After we describe L2TAP and SCIP ontologies, at the end ticipants to log in RDF [28] the provenance assertions of of Subsection 2.2 we will provide justification on why these privacy related events. We also proposed a second pluggable ontologies rely on dereferenceable URIs and how Linked Data ontology SCIP (Simple Contextual Integrity Privacy) to cap- infrastructure allows to achieve log data integration when ture the privacy semantics of log events and enable SPARQL multiple parties contribute into the log their privacy events query-based implementations of auditing and compliance in different points in time. checking in a personalized health workflow2 .The scalability of the framework for compliance checking has been evaluated The L2TAP ontology describes the header of a log event by a set of queries described in [23]. and is scoped to answer the provenance queries about log events, such as who has contributed an event to the log and Using L2TAP+SCIP, this paper proposes a standard way when. The when in L2TAP can be expressed as simple xsd of privacy auditing in the big data research context. We time using two L2TAP properties, l2tap:eventTimestamp and propose multiple SPARQL query-based solutions (with a l2tap:publishingTimestamp. There are some subtlety in captur- limited reasoning support of RDFS) to facilitate the tasks ing the who in L2TAP. Following the second principle of of constructing L2TAP privacy logs (when privacy policies Linked Data, using http:// URIs as names for participants, are applicable to the classes of individuals and data items), amounts to a data publisher choosing part of an http:// names- deriving obligations from privacy policies, and compliance pace that the publisher controls, by virtue of owning the checking. If research teams agree on the semantics of pub- domain name [11]. In L2TAP, the publisher of the log events lishing the log based on L2TAP+SCIP, they can show to the is the logger who owns the domain of the log and can talk auditors their compliance in only one effort and auditors can about the events and their assertions (e.g. https://logRT.org). oversee the compliance of parties involved without additional If an L2TAP logger wishes to identify a participant as the who efforts. In other words, after the log has been created and all in an event header, the logger must register the participant, obligations and their fulfillments are captured the research i.e. mint the URI of the participant with the namespace team can check the log and provide it as an evidence of in its domain. Registered participants will be considered accountability [21]. Using the same log and the SPARQL accountable for the assertions that they make in the log. solutions, all other auditors including the data provider’s auditor, the institution auditors (described in [26] as Insti- The privacy semantics of privacy events (e.g. what is an obligation fulfilment) are contained in the body of a log event 2 L2TAP and SCIP are documented at http://l2tap.org. and expressed using the SCIP ontology. In designing SCIP, 1 a l2tap:Log. 1 https://logRT.org/logevent/e2> a l2tap:ParticipantRegistrationEvent; 2 a l2tap:LogInitializationEvent; 2 l2tap:memebrOf ; 3 l2tap:initializesLog ; 3 l2tap:eventTimestamp "2014-01-27T12:00:00Z"^^xsd:dateTime; 4 l2tap:logger ; 4 l2tap:publicationTimestamp "2014-01-27T12:00:01Z"^^xsd:dateTime; 5 l2tap:publicationTimestamp "2014-01-26T12:00:00Z"^^xsd:dateTime; 5 l2tap:registersAgent ; 6 l2tap:timeline . 6 l2tap:eventParticipant . 7 a foaf:Agent. 7 l2tap:eventData . 8 a l2tap:Timeline; 8 a l2tap:Participant; 9 l2tap:physicalTimeline tl:universaltimeline; 9 l2tap:registeredAgent . 10 l2tap:clock "wwp.greenwichmeantime.com/" ^^ xsd:string; 10 a foaf:Agent. 11 l2tap:clockSyncFreq [tl:duration "P7DT"^^ xsd:duration]. 11 ={ 12 a . 13 a . 14 a . 15 a .} we are inspired by the contextual integrity (CI) perspective [20]. The SCIP ontology provides mapping targets for ba- sic notions of participants in an information flow, privacy Figure 3: Participant registration event contexts, and privacy norms as described in CI. The goal of the SCIP ontology is to define a minimum set of classes, properties and constraints that allows the basic compliance Fig. 3 shows an example of using the L2TAP ontology to queries (e.g. which access request is non-compliant?) to be register the class of research teams as a foaf:Agent with URI (line 5). Note that answered using SPARQL queries. this is the URI of a class of researchers. In lines 8-9 the l2tap:Participant class and the l2tap:registeredAgent property Having two namespaces is the basis for the framework flexi- bility and extensibility. The proposed SCIP ontology is just are used to capture the fact that the URI of the class of re- one instance of a class of pluggable ontologies to express searchers is minted in the logger’s domain. It is optional for a privacy semantics and can be substituted with an ontology participant registration event to use the l2tap:participantData with more-or-less expressive power without impacting the property and add a named graph [4] as the event data (pay- semantics of the log header. load) to the event. Suppose in our motivating scenario RT1 and RT2 are two classes of researchers. RT1 uses the dataset to study patients under 18 and RT2 studies patients 18 year 2.1 Log Event Types and older. The optional named graph can be used to cap- L2TAP specifies three types of log events, log initialization ture this classifications and additional information about the events, participant registration events, and privacy events. members of each class. For example we used the named graph (lines 11-15) to encode research team hierarchy and Log Initialization Events. This type of events defines memberships. Therefore the accountability can be cascaded which l2tap:Log is being initialized using the l2tap:initializesLog to a specific individual. property. It also records assertions on the log characteristics such as who the logger is (l2tap:logger), and how the event Privacy Events. A privacy event is used to encode privacy timestamps are captured (l2tap:logClock). Fig. 2 provides an processes such as expressing privacy policies, access requests, example of using the L2TAP ontology to encode a log event and obligation fulfilment. Fig. 4 shows how the L2TAP that initializes a log with https://logRT.org URI (line 1). This ontology is used to log provenance assertions of privacy privacy log has a logger with https://RT.org/logger URI (line policies applicable to our scenario. The quads in this privacy 4). The logger is a foaf:Agent3 (line 7). The physical timeline event are grouped in two sets. The quads in lines 1-6 are the for this log is a constant in the timeline ontology4 (line9). header of the event and the quads in line 8 onward are the Lines 10 and 11 encode the log’s reference clock and the body of the event. The data provider (PhysioNet) is the one syncing frequency. who submits the quads of the policies to the log (line 3). The body of this log event (wrapped in https://logRT.org/logng/ng1 Participant Registration Events. This event type is used named graph) describes privacy policies and preferences as to register a foaf:Agent as an L2TAP log participant who can the payload of the event. The SCIP ontology is used to then submit the future log events. The l2tap:registersAgent express the semantics of a privacy event’s body. property links a log event to a foaf:Agent who will be recog- nized by the logger as the registered agent. As described above the participant registration event marks the time in- 2.2 Log Event Privacy Semantics stant that a participant’s URI has been minted in the logger’s The L2TAP ontology described so far encodes a privacy domain. So the registered participants will be kept account- event and its accountable participant regardless of the privacy able with respect to the log events that they are contributing semantics of the event. The SCIP ontology provides necessary to the log in the future. In our scenario research teams as vocabularies to capture the privacy semantics. We categorize receivers of data and PhysioNet as the data provider are the semantics in four groups: privacy preferences (policies), participants in the log. If the data was not anonymized each access requests and responses, obligation fulfillments, and individual data subject or a class of data subjects could have access activities. also been registered as participants. Privacy Preferences. The scip:PrivacyPreference class is 3 http://xmlns.com/foaf/spec/ used to encode a context and the norms applicable to the 4 http://purl.org/NET/c4dm/timeline.owl# context. The context in SCIP is characterized using multiple 1 a l2tap:PrivacyEvent; 1 a l2tap:PrivacyEvent; 2 l2tap:memebrOf ; 2 l2tap:memebrOf ; 3 l2tap:eventParticipant ; 3 l2tap:eventParticipant ; 4 l2tap:eventTimestamp "2014-01-28T12:00:00Z"^^xsd:dateTime; 4 l2tap:eventTimestamp "2014-01-29T12:00:00Z"^^xsd:dateTime; 5 l2tap:publicationTimestamp "2014-01-28T12:01:00Z"^^xsd:dateTime; 5 l2tap:publicationTimestamp "2014-01-29T12:01:00Z"^^xsd:dateTime; 6 l2tap:eventData . 6 l2tap:eventData . 7 = { 7 = { 8 a scip:PrivacyPreference; ...} 8 a scip:AccessRequest; 9 scip:dataRequestor ; 10 scip:dataSender ; 11 scip:dataSubject ; Figure 4: The header of a privacy event (for policies) 12 scip:dataItem ; 13 scip:purpose ; 14 scip:requestorRole ; 9 scip:expressedBy ; 15 scip:requestedPrivilege .} 10 scip:hasValidity [time:hasBegining "2014-01-01T00:00:00Z"; 11 time:hasEnd "2015-01-01T00:00:00Z"]; 12 scip:dataItem ; 13 scip:requestorRole ; Figure 6: Log event for access request 14 scip:purpose ; 15 scip:privacyPrivilege ; 16 scip:obligation ; (line 20). scip:occurrenceGap property in line 21 encodes the 17 scip:obligation ; 18 scip:propositionalExpression . relative time interval for performing the obligation (a pos- 19 a scip:ObligationTemplate; itive integer indicates occurrence after the access activity, 20 scip:performAction ; 21 scip:occurrenceGap "-1"^^xsd:integer; property in line 22 encodes the time required to perform 22 scip:performanceDuration "1"^^xsd:integer.} the obligation. The third obligation is a post-obligation and requires to be fulfilled after access has been granted and when a record deem to be identifiable. Figure 5: The body of a privacy event (for policies) Access Requests and Responses. The scip:AccessRequest class is used to encode a request by a researcher to access classes: scip:DataItem, scip:Purpose, and scip:PrivacyPrivilege. a dataset. A number of classes that we used to express Use, collect, and disclosure are different types of privacy privacy policies (such as scip:DataItem, scip:Role, scip:Purpose, privileges. Participants in a context interact with each scip:DataItem, and scip:DataRequestor) will also be used to ex- other in certain capacities or roles. In SCIP, roles of three press access requests. An access request can be initiated by main participants in an information flow are encoded using a class of participants or an individual participant. In our scip:dataSubjectRole, scip:dataRequestorRole, and scip:dataSender motivating scenario, we assume one of the researchers in the Role properties. The scip:Role class is used to capture the team (Mark) uses the framework to log its access request abstract and concrete roles. In SCIP, roles, purposes, data (cf. Fig. 6). Note that the who in the header of this log items and privacy privileges are represented as lattice using event is Mark’s URI (line 3) who is a member of clinical rdfs:subClassOf. researcher class. Line 8 encodes the Mark’s access request as an instance of scip:AccessRequest. The scip:dataRequestor Fig. 5 shows how the SCIP ontology is used to encode the property in line 9 captures the URI of the data requestor obligations described in our scenario. Note that the quads in (Mark), scip:dataSender (line 10) captures who should send this figure are the continuation of the quads in Fig. 4. Line the data (PhysioNet) while scip:dataSubject (line 11) captures 9 describes by whom the privacy preferences are expressed whose data has been requested (Patients class in MIMIC II using the scip:expressedBy property. Note that the minted dataset). Similar to the privacy policies, we encode in line PhysioNet URI is the participant who submits the privacy 12 the URI of requested data items (MEDITEMS: class of policies. The quads in line 10 and 11 describe the validity all medications taken by patients), the purpose for accessing time interval of the policies. The first obligation in our data (line 13), and the roles of the participants requesting ac- scenario (ob1 ) is expressed as legitimate purpose (line 14) cess (line 14). The privacy privilege that has been requested for using the dataset (line 12) and the acceptable roles of is encoded by scip:requestedPrivilege in line 15. participants (lines 13) and the privilege that will be granted if the obligations are fulfilled (line 15). The scip:AccessResponse class encodes the boolean response to an access request as well as the applicable obligations. The There are also norms associated with a context that de- log event shown in Fig. 7 records the access response to the scribe obligations or actions that need to be performed be- Mark’s request by dereferencing the corresponding access fore (pre-obligation) or after (post-obligation) the dataset request URI (line 9). Line 10 encodes the access decision. is accessed [19]. The scip:ObligationTemplate is a subclass Associated with each access response there could be a set of of scip:Obligation that captures these actions. Obligations, applicable obligations. The quads in lines 11-17 encode one expressed in privacy preferences, are templates for future of the obligations derived from privacy policies applicable to instantiation of executable obligations. The scip:Obligation is the study. Lines 11 in this listing refers to the URI of the rdfs:subClassOf scip:ObligationTemplate. Obligations has prop- corresponding obligations using scip:contextObligation. When erties to express temporal constraints associated with an multiple obligations arise from an access request, a propo- obligation. For example, the second obligation requires tak- sitional formula ϕ describes how the satisfaction of these ing the training course (obtain_training_certificate) prior to obligations relates to the overall compliance of the access access. This obligation is encoded using scip:performAction request. In our example scenario ϕ ≡ ob1 ∧ ob2 ∧ ob3 , i.e. 1 a l2tap:PrivacyEvent; 1 a l2tap:PrivacyEvent; 2 l2tap:memebrOf ; 2 l2tap:memebrOf ; 3 l2tap:eventParticipant; 3 l2tap:eventParticipant ; 4 l2tap:eventTimestamp "2014-01-30T19:01:00Z"^^xsd:dateTime; 4 l2tap:eventTimestamp "2014-01-31T12:01:00Z"^^xsd:dateTime; 5 l2tap:publicationTimestamp "2014-01-30T19:01:01Z"^^xsd:dateTime; 5 l2tap:publicationTimestamp "2014-01-31T12:01:01Z"^^xsd:dateTime; 6 l2tap:eventData . 6 l2tap:eventData . 7 = { 7 = { 8 a scip:AccessResponse; 8 a scip:ObligationAcceptance; 9 scip:responseTo ; 9 scip:accepts .} 10 scip:accessDecision "True"^^xsd:boolean; 11 scip:contextObligation ; 12 scip:contextObligation ; 13 scip:propositionalExpression . Figure 8: Log event for obligation acceptance 14 a scip:Obligation; 15 scip:createdFrom . 16 a scip:Obligation; 1 a l2tap:PrivacyEvent; 17 scip:createdFrom .} 2 l2tap:memebrOf ; 3 l2tap:eventParticipant ; 4 l2tap:eventTimestamp "2014-02-01T12:01:00Z"^^xsd:dateTime; Figure 7: Log event for access response 5 l2tap:publicationTimestamp "2014-02-01T12:01:01Z"^^xsd:dateTime; 6 l2tap:eventData . 7 = { 8 a scip:PerformedObligation; 9 scip:performedFor ; all three obligations must be fulfilled for the access to be 10 scip:performedBy ; compliant. The scip:propositionalExpression property in line 11 scip:occurredIn "2014-02-01T11:00:00Z"^^xsd:dateTime .} 13 encodes this formula. The rest of the quads in Fig. 7 links each of the performable obligations to the corresponding obligation templates in the privacy policy. So the character- Figure 9: Log event for performing an obligation istics of each obligation such as the action and the temporal constraints associated with the obligation become resolvable. Performing Obligation. In the scenario, one of the re- The who in this log event (line 3) is https://logRT.org/participa search team members (Mark) is the participant who must nts/PN_ACLAgent indicating that the participant who has logged perform the obligations as conditions to access to the dataset. the response is an ACL agent of PhysioNet, implementing The first obligation (obtain_training_certificate) is a pre- access control and obligation derivation. These mechanisms obligation, meaning that the research team must obtain the are usually domain-dependent. In [23] we described how certificate and log this action as an evidence prior to access. we can derive obligations from privacy preferences using Fig. 9 shows a log event that captures the fact that Mark SPARQL queries. Obligations also can be derived using more- has performed the first obligation. Line 8 defines the per- or-less complex mechanisms. From the logging perspective formed obligation as an instance of scip:PerformedObligation what is necessary is to have a mechanism in place to log class. Line 9 refers to the URI of the corresponding obliga- the access decisions and obligations, regardless of which tion logged in the access response. The participant who has mechanism is used to control access or derive obligations. performed the obligation and the time instant of performing the obligation are encoded using the scip:performedBy (line 10) When the obligations are derived from the privacy pref- and scip:occurredIn (line 11) respectively. Note that Mark is erences (obligation templates) and logged, the obligation the who has submitted these quads to the log (line 3). performer who can be the same participant as the data requestor (Mark, the researcher) or a different participant Access activity. Finally, SCIP has a class scip:AccessActivity must fulfill the obligation in an acceptable time interval to record the occurrence of an access activity. Fig. 11 shows and log its fulfillment. SCIP has a number of properties the log event of an access activity when the research team to capture the participant who should perform an obliga- (including all its members) has accessed the dataset. Line 9 tion (scip:obligationPerformer), the participant who actually refers to the URI of the corresponding obligation acceptance performs the obligation (scip:performedBy), the one who can event using scip:forObligationAcceptance. Line 10 captures the witness the violation of an obligation (scip:obligationWitness) time instant that the access activity occurred. The prove- and the one who actually witnesses (scip:attestsViolation). nance assertions for this log event (line 3) shows that the re- searcher is the participant who logs the access activities. We Obligation Acceptance. When the access response has assume that the data provider (PhysioNet) has also a mech- been logged, the research team (as the obligation performer) anism in place to log all accesses to its dataset. Therefore, if accepts to perform the obligations. This event captures the researcher fails to log an access activity the discrepancy the researcher’s commitment as a performative act. The between the provider’s access log and the L2TAP audit log performative act is the utterance of a self-describing act will trigger a non-compliance incident. which is performed by declaring that one is doing it [1]. Fig. 8 shows the log event for obligation acceptance. The event’s The justification for leveraging the Linked Data infrastruc- participant (line 3) is Mark, one of the registered researchers. ture and derferenceable URIs become evident as we walk This event refers to the URI of the access response (line through the log events for the motivating scenario described 9). By the virtue of logging this event, the researcher not above. We summarized registration of the log events in only acknowledges existence of the obligations but also as a Fig. 11. Participants make statements about the events performative act commits himself to perform the obligations in the log. Therefore, they need to access the events data as conditions to access. to dereference the past events URIs that may have been 1 a l2tap:PrivacyEvent; 2 l2tap:memebrOf ; Research Team L2TAP Audit Log   Data Provider RT1 PhysioNet 3 l2tap:eventParticipant; 4 l2tap:eventTimestamp "2014-02-02T00:01:00Z"^^xsd:dateTime; 5 l2tap:publicationTimestamp "2014-02-02T00:01:01Z"^^xsd:dateTime; Privacy Policies 6 l2tap:eventData . (Fig. 4 & 5) 7 = { Access Request 8 a scip:AccessActivity; (Fig.6) 9 scip:forObligationAcceptance ; 10 scip:occurredIn "2014-02-02T00:00:01Z"^^xsd:dateTime .} Access Response (Fig. 7) Figure 10: Log event for access activity Obligation Acceptance (Fig. 8) logged by other participants in the different points in time. Performed Obligation For example, the privacy policies are registered by Phys- (Fig. 9) ioNet on Jan 01, 2014, then the access request has been Access Activity logged by the research teams on Jan 29, 2014. The access (Fig. 10) response event has been logged by the Physionet access con- trol agent on Jan 30th referring the URI of the access request (scip:responseTo ). The access re- Figure 11: Registering log events into the log and sponse also refers to the URIs of obligations registered by required URI dereferencing PhysioNet as part of the privacy policies (e.g. scip:createdFrom ). Analogously the log events encoding the acceptance of obligations, fulfilment of an obligation by one of the researchers and the access activity logged by the applicable privacy policies. An individual’s personal informa- research team refer to the URIs of the other past log events. tion can span from a very specific data item (e.g. the glucose The statements that each of these participants wants to make level in a blood work) to a very general data item (e.g. the depends on the URIs of the statements have been previously personal health record of an individual). Privacy policies and logged. regulations (e.g. HIPPA) are not only applicable to an entire dataset but also may apply to a specific class of data items The events in Fig. 11 do not necessarily occur in the sequence (e.g. mental health data) or a specific class of individuals (e.g. shown. Consider a scenario in which the researcher logs an children under age of 12). Individuals (e.g. data subjects access to the dataset referencing an obligation acceptance’s in our scenario) may have options to express their personal URI. However, the researcher happens to not log an obliga- privacy preferences applicable to the instances of their data. tion fulfilment event corresponding to the access response. Expressing everything in the log (including data items, par- So the access response’s URI not be referred by a performed ticipants, etc.) using dereferencable URIs provides the most obligation event and in turn the corresponding access request flexible and generic way of representing resources involved in would also not be referred. This results in a non-compliant privacy processes. Furthermore, RDF representation of audit access request and the researcher would become accountable logs using L2TAP and SCIP ontologies allows both the URI for not logging the obligation fulfilment event. Therefore, of a class of resources or URI of an instance of a resource the L2TAP+SCIP ontology relies on the URI dereferencing (participants or data items) to be dereferenced and reasoned to make actions of each participant transparent for other about using RDFS. As shown in Fig. 3 members of the class participants involved in the process (of course for the par- of researchers are defined using a named graph as a log event ticipants who have been authenticated) and provide support payload. Exploiting rdfs:subClassOf allows to reason about for accountability and privacy. the entire class of researchers or a specific individual in the class when evaluating an obligation derivation query or a compliance query as described below. With the same token 3. QUERY-BASED AUDITING applicable privacy policies and preferences can be determined The fundamental aspect of leveraging RDFS and Linked for a class of data subjects, a class of data items, or for one Data to generate L2TAP logs is to facilitate privacy audit instance of the same classes. tasks by queries over the created logs. In this section we will first discuss how the standard RDFS and computation Log Construction. In our motivating scenario, the re- of transitive closures for the refs:subClassOf relationship can search institute is the one who needs access to the datasets be exploited to support query-bases audit tasks. Then we for its researchers and also wants to keep its researchers describe three major audit tasks (constructing the log with accountable with respect to the dataset usage policy. There- data usage policies, obligation derivation and fulfilment, and fore, the research institute initializes the log and registers the compliance checking) that all can be supported by SPARQL participants. The institute then uses the log in the future queries with a limited RDFS reasoning support. These tasks and show to the interested auditors that its researchers are involve several classes of participants including data provider, compliant with the policies. On the other hand, the data data receiver (research teams), and auditors. provider wants to be able to express the norms and poli- cies that govern the data usage. So the provider wants to RDFS Reasoning Support. By leveraging Linked Data contribute to the log these policies and record all accesses for privacy audit log we can achieve a flexible way to deal to datasets. We illustrated throughout Fig. 1-4 the set of with data items granularity, participants granularity, and quads that need to be stored in an L2TAP log for these tasks. All quads in these figures can be appended to an L2TAP 1 ASK log using SPARQL 1.1 [29] commands in three steps: first 2 WHERE { 3 ?obAcc scip:accepts ?response. a named graph will be created for a log event using CREATE 4 ?response scip:responseTo ?request. GRAPH , second the quads of the log header will be inserted 5 ?response scip:contextObligation @ob. to the log default graph and then the quads of the log event 6 @ob rdf:type scip:Obligation. body will be inserted into the named graph using INSERT DATA 7 @ob scip:occurrenceGap ?occGap. 8 @ob scip:performanceDuration ?pD. {GRAPH { }}. 9 OPTIONAL {?accessActivity scip:forObligationAcceptance ?obAcc}. 10 OPTIONAL {?accessActivity scip:accessedTime ?accessTime}. Obligation Derivation. After the log is constructed, the 11 OPTIONAL {?performedOb scip:performedFor @ob}. 12 OPTIONAL {?performedOb scip:performedBy ?performAgent}. research teams (or an individual researcher) want to be able 13 OPTIONAL {?performedOb scip:occurredIn ?obligationTime}. to derive the obligations applicable to the class of data items 14 OPTIONAL {?witness scip:attestsViolation @ob}. or data subjects that they want to access. This task can be 15 FILTER (((!bound(?performAgent) && !bound (?accessTime)) 16 ||(bound (?accessTime) && (xsd:integer(@currentTime) < = accomplished through computation of transitive closures for 17 fn:max((xsd:integer(?accessTime) + xsd:integer(?occGap) + xsd: the rdfs:subClassOf relationship. Norms in the SCIP ontology integer (?pD)), are defined in terms of data items, roles of participants who 18 (xsd:integer(?accessTime) + xsd:integer(?occGap)))))) && 19 (!bound(?witness))) } want to use data items, purpose of usage, and requested access privilege. All these concepts are expressed in SCIP by a lattice using rdfs:subClassOf. For example children under Figure 12: Evaluating the fulfilment of an individual 12 are rdfs:subClassOf data subjects. Therefore, a SPARQL obligation query with the RDFS reasoning support allows to match the context of a set of privacy policies with the context of an Data: Access request: rq, currentTime: t access request. The query conditions check that all instances Result: Boolean Compliance value for rq of data items, data subjects, roles, privacy privileges asked 1 OB ← set of derived obligations for rq ; 2 φ ← propositional formula for rq; by the research teams in the access request graph, can be 3 foreach obi ∈ OB do subsumed by the corresponding items in the privacy policies 4 oi ← answer of (SPARQL ASK obligation query (Fig. graph. Then the output of the query will be applicable 12)); obligations to that access request. The method has been 5 Substitute obi in φ with oi ; 6 end described in more details in our earlier publication ([23]- 7 Substitute φ in Compliance Ask Query; Section 3). 8 C ← answer of (SPARQL ASK compliance query (Fig. 13)); 9 return (C) Compliance Checking. An important audit task is to Algorithm 1: An algorithm for compliance checking identify, at any given point in time, if an access request is in compliance with the applicable privacy policies. Compliance of an access request is decided based on the status of its able that will be used in the expression in line 4 to evaluate corresponding obligations. Therefore, a typical compliance the access request compliance queries. The FILTER statement checking task will be performed in three steps as illustrated in is the conjunction of ϕ and ?accessDecision meaning that if Algorithm 1. First multiple SPARQL ASK queries evaluate the access decision logged by the access control mechanism the status of all individual obligation and return true for an is false even if all obligations are fulfilled the access request obligation if it is fulfilled and false otherwise. The template would be non-compliant. query shown in Fig. 12 can be used for evaluating the fulfil- ment of an obligation after the parameter @ob is substituted with the URI of an obligation. A similar template query can 1 ASK 2 WHERE { ?response scip:responseTo @rq . be used to evaluate a pending obligation (an obligation that 3 ?response scip:accessDecision ?accessDecision . the conditions for its fulfillment not yet settled). 4 FILTER (@phi && xsd:boolean(?accessDecision)) } For each access response a propositional formula will be also logged indicating how the fulfilment of an individual Figure 13: Evaluating an access request compliance obligation contributes to the overall compliance of an access request. In our scenario the formula is ϕ ≡ ob1 ∧ ob2 ∧ ob3 i.e. A number of other compliance queries (e.g. which obligation all three obligations must be fulfilled for the access request is pending or which access request is not compliant at time t), to be compliant. The second step in the algorithm is to the experimental validation of the scalability of our solution, substitute the propositional variable in ϕ with the truth- and the practical benefits of our approach are described in values representing the state of every derived obligation. [23]. Each obi in this formula will be substituted with oi which can be true or false depending on the evaluation of the query 4. RELATED WORK in Fig. 12. Our research study is inspired by the concept of information accountability as described by Witzner et al. [30], that is The third step in the algorithm is to substitute ϕ as a propo- ensuring whether the policies and configured preferences that sitional variable and evaluate the template query in Fig. 13 govern the flow of personal information, are respected by to check the overall compliance of the corresponding access the parties that collect, use, and share users’ data. In an request. Note that in line 3 of the query in Fig. 13, we early work on the management of policies and the seman- include the graph encoding the access decision of the access tic web [14], Kolovski et al. emphasize on the need for a request. The ?accessDecision variable is a propositional vari- declarative access policies to support scalable information sharing among parties. The authors then propose a rule- privacy concerns in the emerging domains of linked data based discretionary access control language for the web. In applications, Speiser et al. [24] propose a privacy framework [13], Kagal et al. propose Rein, a policy framework grounded for policy specification and access control enforcement.While in semantic web technologies. The authors acknowledge and access control is a necessary mechanism to protect individu- respect the diversity and heterogeneity of policy languages als’ privacy, it is not sufficient to express and control data on the web and propose Rein as an ontological framework usage policies. The work introduced in this paper addresses for policy interoperability. The ontology proposed in this privacy concepts such as usage purposes and obligations after paper supports information accountability via privacy audit access. logs and complements the Rein proposal [13] by providing a SPARQL query based solutions for the basic compliance 5. CONCLUSIONS checking queries. While compliance auditing is mandated in different privacy legislation (e.g. [26, 21]), it has received less attention from There are solid theoretical foundations for policy auditing the research community. In this paper we continued our work over logs [2, 9, 3, 5]. Barth et al. use Alternating-time in [23] and showed that regardless of what logic is used to Temporal Logic to build a logical privacy model and design express privacy policies there is a standard way for privacy a privacy language (LPU) to express norms [2]. The concept logging that allows basic privacy events to be logged and of norms in this work has been adapted from the Contex- provides a scalable query-based solution for answering com- tual Integrity perspective [20]. The LPU language allows all pliance queries. We also demonstrated that L2TAP Linked communications between agents to be recorded in a logical Data Log is capable of facilitating basic privacy auditing trace. Norms are expressed as logical constraints and privacy tasks such as: constructing the log, obligation derivation, compliance is related to the logical concepts of satisfiability and compliance checking in the big data and linked data and entailment. Datta et al. [9] extended the LPU language research context. In our approach, the convenience of Linked with reasoning about information accountability over incom- Data and RDFS has been sought for privacy log interoperabil- plete logs. Basin et al. use metric first order temporal logic ity and facilitating accountability and transparency among (MFOTL) to express policies, which are then monitored to participants. verify whether the trace of actions satisfies desired temporal properties [3]. Cederquist et al. describe a framework that 6. ACKNOWLEDGMENTS uses audit logs to enforce compliance with discretionary ac- Financial supports from the NSERC Canada and Privacy cess control policies [5]. While this body of work propose Awards from IBM and the Information and Privacy Com- highly expressive privacy logic, lack of support by an scalable missioner of Ontario are greatly acknowledged.We thank the semantic technology prevents the approaches to be applied anonymous reviewers for their comments. outside of research labs. An important related work is the recently proposed RDF 7. REFERENCES provenance model (PROV-DM) [17]. The focus of PROV-DM [1] J. Austin. How to do things with words, volume 88. is on providing a domain independent ontology for asserting Harvard University Press, 1975. provenance of a resource on the web. While the provenance [2] A. Barth, A. Datta, J. C. Mitchell, and H. Nissenbaum. assertions of the L2TAP+SCIP log events (log event header) Privacy and contextual integrity: Framework and can be expressed using PROV-DM ontology, the ontology applications. In Proc. SP, pages 184–198, 2006. cannot support the structure needed to encode the semantics [3] D. Basin, F. Klaedtke, and S. Müller. Policy of the body of privacy events (e.g. privacy preferences, obli- monitoring in first-order temporal logic. In Proc. CAV, gations, and purpose of usage). A simple mapping between pages 1–18, 2010. L2TAP and PROV-DM allows a log event (regardless of its [4] J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named content) to be expressed by the PROV-DM ontology. The graphs. Web Semantics: Science, Services and Agents mapping requires adding a prov:Activity (i.e. defining a URI on the World Wide Web, 3(4), 2011. for the act of generating the l2tap:LogEvent as a prov:Entity). [5] J. Cederquist, R. Corin, M. Dekker, S. Etalle, J. den Then the assertion of the who, l2tap:eventParticipant, will Hartog, and G. Lenzini. Audit-based compliance be mapped to the prov:wasAssociatedWith property. The two control. Int. J. of Info. Security, 6:133–151, 2007. L2TAP properties capturing the when assertions are mapped [6] A. Chuvakin, E. Fitzgerald, R. Marty, R. Gula, to prov:startedAtTime and prov:endedAtTime respectively. W. Heinbockel, and R. McQuaid. Common event expression, 2008. In recent years, we have seen several proposals addressing [7] L. Costabello, S. Villata, N. Delaforge, F. Gandon, privacy in the Linked Data context ([22, 18, 7, 8]). This body et al. Linked data access goes mobile: Context-aware of research are mainly proposing access control frameworks authorization for graph stores. In LDOW- WWW, 2012. based on access control lists (ACLs). Authors in [22] propose [8] L. Costabello, S. Villata, O. R. Rocha, and F. Gandon. a privacy preferences vocabulary that can be utilized to ex- Access control for http operations on linked data. In press fine-grained access policies in Linked Data environment. ESWC, pages 185–199, 2013. Muhleisen et al. propose an access control mechanism for [9] A. Datta, J. Blocki, N. Christin, H. DeYoung, D. Garg, social web applications [18]. This framework uses SWRL to L. Jia, D. Kaynar, and A. Sinha. Understanding and express access rules. Authors in [12, 7, 8] leverage the Linked protecting privacy: formal semantics and principled Data architecture for providing authorizations and access audit mechanisms. In Proc. ICISS, pages 1–27, 2011. restrictions at the document level [12]. The authorization [10] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. mechanism in [12] is based on WebID [25]. To address the Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Information accountability. Commun. ACM, Physiobank, physiotoolkit, and physionet components 51(6):82–87, 2008. of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220, 2000. [11] T. Heath and C. Bizer. Linked data: Evolving the web into a global data space. Synthesis Lectures on the Semantic Web: Theory and Tech., 1(1):1–136, 2011. [12] J. Hollenbach, J. Presbrey, and T. Berners-Lee. Using RDF metadata to enable access control on the social Semantic Web. In Proc. CCMLSK WS at CK, 2009. [13] L. Kagal, T. Berners-Lee, D. Connolly, and D. Weitzner. Using semantic web technologies for policy management on the web. In Proc of the National Conference on Artificial Intelligence, 2006. [14] V. Kolovski, Y. Katz, J. Hendler, D. Weitzner, and T. Berners-Lee. Towards a policy-aware web. In Semantic Web and Policy Workshop at the ISWC, 2005. [15] S. Loosemore, R. Stallman, R. McGrath, A. Oram, and U. Drepper. The GNU C library reference manual. Free software foundation, 2001. [16] G. B. Moody and L. Lehman. Predicting acute hypotensive episodes: The 10th annual physionet/computers in cardiology challenge. In Computers in Cardiology, pages 541–544. IEEE, 2009. [17] L. Moreau and P. Missier. PROV-DM: The PROV data model. W3C Recomm., W3C, June 2012. [18] H. Mühleisen, M. Kost, and J.-C. Freytag. SWRL-based Access Policies for Linked Data. In Proc. SPOT Workshop at SSW, 2010. [19] Q. Ni, E. Bertino, and J. Lobo. An obligation model bridging access control policies and privacy policies. In Proc. SACMAT, pages 133–142, 2008. [20] H. Nissenbaum. Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford Law Books, 2009. [21] Official Journal of the EC. EU directive 95/46/EC on the protection of individuals rights with regard to the processing of personal data, 1995. [22] O. Sacco and A. Passant. A privacy preference ontology (PPO) for Linked Data. In Proc. LDOW, WWW, 2011. [23] R. Samavi and M. P. Consens. L2TAP+SCIP: An audit-based privacy framework leveraging Linked Data. In CollaborateCom (TrustCol), pages 719–726, 2012. [24] S. Speiser. Policy of composition? composition of policies. In Proc. POLICY, pages 121 –124, 2011. [25] H. Story, B. Harbulot, I. Jacobi, and M. Jones. FOAF+SSL: RESTful Authentication for the Social Web. In Proc. SPOT, 2009. [26] US Congress. Health Insurance Portability and Accountability Act of 1996, Privacy Rule. 45 CFR 164, Aug. 2002. [27] US Department of Health and Human Services. Code of Federal Regulations, Title 45 - Part 46 - Protection of Human Subject, Revised January 15, 2009. [28] W3C. RDF Vocabulary Description Language 1.0: RDF Schema. W3C, April 2002. [29] W3C. SPARQL 1.1 Query Language, W3C Proposed Recommendation. W3C, November 2012. [30] D. J. Weitzner, H. Abelson, T. Berners-Lee, J. Feigenbaum, J. Hendler, and G. J. Sussman.