Towards an Access Control Model for Knowledge Graphs (Discussion Paper) Marco Valzelli1 , Andrea Maurino1 , Matteo Palmonari1 and Blerina Spahiu1 1 Department of computer systems and communication, University of Milano - Bicocca, Milano, Italy Abstract Nowadays Knowledge Graphs are a common way to integrate and manage large amount of information. This involves also sensitive domains like security, so the management of access control on these graphs became crucial. Due to their dimension, Knowledge graphs are often stored using NoSQL solutions, that have very poor support to access control. In this paper a distributed and secure Knowledge graph management system is presented. The system supports both open and closed access control and its architecture guarantees the secure access of very large knowledge graph by means of query transformation. Keywords access control, knowledge graph, NoSQL, security 1. INTRODUCTION In 2012 Google introduced the Knowledge Graph (KG in the follow) as a new technology to improve its famous search engine. Basically, it is a graph that contains all the main named-entity of the world and the relations between them, so that the search engine is able to identify the subject of the query and suggest all object or features related to it. Since their introduction, KGs have been largely used by a lot of big and small companies (e.g. Amazon, Facebook and so on), but they are also used in several research domains as a way to describe in a common and shared way the knowledge of a given field. Several companies are now aggregating the knowledge stored in both structured and unstructured sources into a unique, comprehensive KG, for a better data management. It is worth noting that a KG can be considered as an evolution of the traditional data integration approach, and as a consequence the size of KG is bigger than previous solutions. As the uses case of KG increase, they start to include also very sensitive data like for cyberse- curity purposes [1], in the military sector [2] the law enforcement sector [3], but also civilian cases like the medical one [4, 5]. For example in this last case, the KG includes all information related to doctors, patients, diseases and so on. Some people may access the KG in order to analyze correlation among clients but they cannot access to personal details of patients, while SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy Envelope-Open m.valzelli@campus.unimib.it (M. Valzelli); andrea.maurino@unimib.it (A. Maurino); matteo.palmonari@unimib.it (M. Palmonari); blerina.spahiu@unimib.it (B. Spahiu) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) another person e.g. a doctor can access his patients’ information but s/he cannot see information about others doctors. So, the explosion of KG raises new issues wrt. the preservation of relevant information from unauthorized access and their growing request for scalability at a higher level. Currently, there is no common opinion about how to represent a KG. Among others, RDF is one of the most adopted description languages, SPARQL is the query language for RDF model and many systems for managing RDF data (also called triple store) are available. However there are proposals to implement an access control mechanism in RDF [6, 7, 8], but they are not scalable nor flexible and, most important not implemented in commercial triple store. Recently, others graph models like the Labelled Property Graph model (LPG from now on) has been used to represent knowledge graphs. In this extended abstract, we propose a solution for enabling access control of large KG in a scalable way. The scalability issue is solved by means of Janusgraph1 a distributed and scalable property graph. We provide a systematic translation of RDF concepts in terms of property graph model. Then we define an access control method based on logical description of KG and users in terms of property graph model; our solution support a subset of SPARQL query languages by optimizing Gremlinator [9] a software layer able to translate a SPARQL query into a property graph query. The paper is organized as follows: in section 2 we report the most important contribution of literature and industrial solution for the access control problem applied to graphs, while in section 3 we provide a more formal description of the problem. Section 4 presents the proposed access control method and in section 5 the implementation of the model is shown. Finally in section 6 we draw conclusion and discuss future works. 2. Related Work The great success of graphs as data model is very recent, but in the past we can find that trees, a particular type of graph, have been used as a data structure on which apply access control. With the need to store data comes along the need to provide an access control framework. An example is XML, a format widely used to exchange information across the web, on which has been proposed an access control model[10]. Starting from 2001 with the birth of the semantic web [11], the RDF semantic graph framework provided an efficient way to store big amounts of data in a graph format. RDF datasets usually are meant to be shared across the web, so no access control models have been provided. Since the users of RDF endpoints are often anonymous, the literature offers many proposals of context-based access control frameworks. One of these [12] also suggests an interesting mechanism for the creation of view on the graph based on the composition of allow/deny rules. The researches in this area also proposes to define policies through ontologies and to apply these policies using ontological reasoning. Research in this area also tried to address particular aspects of this technology like distribution [6, 13], federation [14] inference of new information [15] and possible solutions to the scalability problem [7]. Also more traditional access control models like DAC (Discretionary Accesss Control) have been implemented [8]. Although there are significant literature for access control in RDF Graphs [16], these approaches cannot be directly applied in other graph contexts as they focus on different 1 janusgraph.org aspects or heavily rely on ontological reasoning, that is usually provided only by native RDF stores. Available NoSQL tools were not designed with security and privacy criterion and still lack in this aspect [17, 18]. Just a small part of these solutions support access control features, and only for specific types of policies, like RBAC (Role-Based Access Control). Also notable is the case of Accumulo2 where access control at cells level is provided. Graph model is supported by several NoSQL products, but in a very different way from RDF systems. While the letters are based on triples, the former adopt the property graph model. Graph database can be divided into native, where data is physically stored in terms of nodes and edges, and the hybrid, that provides a property graph model but rely on a different storage model like the wide-column. Neo4J is the most common native graph database and it implements the LPG model. It still has only basic security and access control features: it allows to define user role and their privileges3 , but not at a fine-grained level. The only significant work in the property graph context [19] has designed an access control model defining how users or groups of users can access and manipulate data. However, it has some drawbacks like lack of flexibility and redundancy. In fact, you can only describe positive permissions and you have to define them for every node, so it’s not clear how the model can be applied with inheritance and conflicts. Furthermore, this work and the neo4j proposal has both the characteristic that they requires quite complex algorithm to be evaluated, so their scalability is in doubt. A different approach has been proposed in the Tinkerpop community4 where the evaluation mechanism of authorizations has been integrated in the Graph-DMBS but it checks for access control metadata in every edge and node, so it needs a certain redundancy both in the stored data and in the interrogation, impacting on the performance of the system. 3. Considered Use Cases In the past years the literature on access control, thanks to a great adoption in commercial DBMS, has defined a variety of policies one can implement[20]. We focus on three main kind of policies for access control, that are: DAC, MAC, and RBAC. The former provides for a direct rights specification between users and resources. The letter allows to assign rights to users based on their roles, while MAC provides access to the resources based on the clearance level of the user, following the need-to-know principle (closed policy): a subject should only be given access rights that are required to carry out the subject’s duties. These access control policies were designed to be mutually exclusive, but instead we want to combine and use them at the same time. We extend it by adding more abstraction also over resources, borrowing the notion of classes from RDF ontologies. In fact, groups of users and roles are ways to assign the same rights to more users at the same time, saving memory and maintainability time. We pursue the same objective over resources by giving the possibility to define access rights over entity classes and then propagate them using inheritance mechanisms. 2 https://accumulo.apache.org/ 3 neo4j.com/blog/role-based-access.-control-neo4j-enterprise/ 4 archive.fosdem.org/2019/schedule/event/graph_access_control_tinkerpop Figure 1: Graph pattern for the access control framework. ”Can” and ”Cannot” edges cannot be present simultaneously For example, let’s consider the case of a national security agency fighting terrorism. It KG includes classes like Person, Organization and Work. We call this extended graph of the domain Graph. It should be noted that we can use relations between classes as a way to extend access rights between instances that belongs to those classes. In fact, if is defined a relation between ”Criminal” and ”Document” classes and a user has access to a specific criminal profile, probably she will have access also to his related documents (except for clearance). It is possible to represent as a graph also users, that can be grouped by departments of the agency or their roles. We call this part of the graph the User’s graph. According to our scenario, we want to specify only what a user can view, adopting a closed policy. That can be done on the base of both her role and groups,as foreseen by RBAC policy. Probably, different departments of the organization investigate different areas of the world, so they have to be able to view only a small portion of the graph. In addition, we may want to cover also particular cases (like temporary operation) that allows access to specific documents, by means of DAC rules. Moreover, all those rules have to match a user’s clearance level (MAC). 4. Access Control Model The proposed access control model has the goal to design a security mechanism that is reusable for many different policies, especially both the cases of open and closed policies. Our approach is focused on giving or not access to resource. More specific rights like own, delete, modify can be built on the top of our model. The idea behind our model is to exploit all the flexibility of the LPG model so we do not give a rigid data structure, but only some graph patterns to follow, as can be seen in figure 1. Three are the main part of our model: the user’s subgraph, the resource’s subgraph and the authorization’s edges between them. In the user’s subgraph users are represented by nodes with properties such as the clearance level. They can be grouped in groups by linking them to users’ group nodes. This mean they will inherit the access rights of the group. Then, groups can be grouped themself iteratively. In the same way users can be grouped by roles’ node. Potentially these types of nodes can coexist, but this will lead to a more complex rights administration. As we mentioned before, we use relation between classes as a way to extends rights between instances. In fact, it is a common situation that some objects are semantically dependents to another object, that we call resource category. Our intuition is to use this kind of relation between classes to automatically insert specific ”extends_rights” edges between their instances. Finally, we can specify access rights between users or user’s groups and resource’s classes with authorization edges. In our model we want to allow both open and closed policies, so these edges can have opposite meanings. With closed policies access’ rights are usually assigned with the ”need to know” principle, which is that a user has to be given access to the resources strictly necessary to carry out his duties, and everything else is forbidden. On the contrary, using open policies are specified resources where a user has not access and all that is not specified is allowed. In addition, our model provides for particular policy specification using exception edges, that are authorization edges between user and resource in contrast to those between users’ groups and resources’ categories. Imposing this constrain we ensure to allow expressiveness without leading to too much complexity. Figure 2: example of KG integrated with the users’ Graph and the authorization edges Determining which node a user can access with a traversal has many advantage: • The graph model has great readability and also inheritance mechanism are easy to understand • You can specify a great variety of policies with a few edges and attributes • Full-scan are not needed as you can just use edges to reach resources you are or not allowed to access • Maintainability is also simplified. For example, if a user wants to know which users have access to an object, she can exploit the reverse path used during the resource’s access mentioned before As an example of the above mentioned concepts let consider figure 2 whose Domain Graph is taken from DBpedia. The user’s graph is linked to the Domain Graph with specific edges, and it is also integrated by AC edges itself. Let’s clarify their meanings: • Green edges represent the positive ”can_view” permission. • Red edges represent the negative ”cannot_view” permission. • Blue edges represent the extend rights relation. This last relation can be defined between specific classes of the Domain Graph Ontology. For example, we specify that the relation ”mentioned_in” extends the access rights from entity of type ”Person” to entity of type ”Work”. Then we can use automatic procedure to add the access control attribute ”extends_rights” to all the edges ”mentioned_in” between entities of the classes defined before. The graph shown in figure 2 allow to describe the follow access rules: • The department ”Africa NSA” is allowed to view all information about Boko Haram organisation, but: – The object ”Intercept 003765” cannot be seen from agent Paul and agent Patricia due to too low clearence level – The object ”Satellite image 05/02” cannot be seen from agent Paul due to a ”cannot” edge. • The department ”Middle East NSA” is allowed to view all Daesh organisation and Boko Hara is and affiliated organization. • Agent Rick and agent Patricia are Director of their department, and this allows them to have full access to both organisation Moeover, let assume that the agency found a connection between Daesh and the drug dealer Ben Ziane Berhily, who supplies daesh amphetamine. Agent Linda is now allowed to view Ben Ziane data with an exception specified with the ”can” edge. 5. Implementation For the implementation of our access control model, we rely on Tinkerpop 5 , which is an Apache project that provides several services and is compatible with fairly all the most common graph products. The most useful feature of this framework for our purposes is the SubGraphStrategy, which is a traversal strategy that allows to create a virtual subgraph based on given constraints. What this method really does is to provide an automatic mechanism that, given a user query, 5 tinkerpop.apache.org/ verify at every step if the initial constraints are satisfied. Since we use a pattern-based approach integrated with exceptions, we cannot use constraints based on attributes. Our solution is to split the process into a two step. First, we retrieve all the IDs of the resource a user can access, then we use this list as a constraint for the traversal strategy. This prevents us to check for many different conditions at every step using a more immediate check like equality between IDs. 6. Conclusions and Future Work We find out that in the state of the art there are not efficient, flexible and general-purpose access control models for Knowledge graphs. We propose an approach based on graph traversal over specific patterns and on the creation of a subgraph using a Tinkerpop feature to address this issue. This is a preliminary model on the top of which it has to be build a more complete access control policy that includes write, delete and own rights. We left as a future work also an extensive test of the scalability of the model. References [1] A. Piplai, et al., Creating cybersecurity knowledge graphs from malware after action reports (2019). [2] F. Liao, L. Ma, D. Yang, Research on construction method of knowledge graph of us military equipment based on bilstm model, in: International Conference on High Performance Big Data and Intelligent Systems, IEEE, 2019, pp. 146–150. [3] P. Szekely, et al., Building and using a knowledge graph to combat human trafficking, in: International Semantic Web Conference, Springer, 2015, pp. 205–221. [4] M. Rotmensch, Y. Halpern, A. Tlimat, S. Horng, D. Sontag, Learning a health knowledge graph from electronic medical records, Scientific reports 7 (2017) 1–11. [5] L. Shi, et al., Semantic health knowledge graph: Semantic integration of heterogeneous medical knowledge and services, BioMed research international 2017 (2017). [6] L. Kagal, T. Finin, A. Joshi, A policy based approach to security for the semantic web, in: International semantic web conference, Springer, 2003, pp. 402–418. [7] F. Abel, other, Enabling advanced and context-dependent access control in rdf stores, in: The Semantic Web, Springer, 2007, pp. 1–14. [8] S. Kirrane, Linked data with access control, in: Workshop on. pp, volume 14, 2015, p. 23. [9] H. Thakkar, D. Punjani, J. Lehmann, S. Auer, Two for one: Querying property graph databases using sparql via gremlinator, in: International Workshop on Graph Data Man- agement Experiences & Systems (GRADES) and Network Data Analytics (NDA), 2018, pp. 1–5. [10] E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, A fine-grained access control system for xml documents, ACM Transactions on Information and System Security (TISSEC) 5 (2002) 169–202. [11] T. Berners-Lee, J. Hendler, O. Lassila, et al., The semantic web, Scientific american 284 (2001) 28–37. [12] R. Stojanov, S. Gramatikov, I. Mishkovski, D. Trajanov, Linked data authorization platform, IEEE Access 6 (2017) 1189–1213. [13] J. Hollenbach, J. Presbrey, T. Berners-Lee, Using rdf metadata to enable access control on the social semantic web, in: Workshop on Collaborative Construction, Management and Linking of Structured Knowledge, volume 514, 2009, p. 167. [14] M. Goncalves, M.-E. Vidal, K. M. Endris, Pure: A privacy aware rule-based framework over knowledge graphs, in: International Conference on Database and Expert Systems Applications, Springer, 2019, pp. 205–214. [15] A. Jain, C. Farkas, Secure resource description framework: an access control model, in: Symposium on Access control models and technologies, ACM, 2006, pp. 121–129. [16] S. Kirrane, A. Mileo, S. Decker, Access control and the resource description framework: A survey, Semantic Web 8 (2017) 311–352. [17] E. Sahafizadeh, M. A. Nematbakhsh, A survey on security issues in big data and nosql, Advances in Computer Science: an International Journal 4 (2015) 68–72. [18] A. A. Alotaibi, R. M. Alotaibi, N. Hamza, Access control models in nosql databases: An overview (2019). [19] C. Morgado, G. B. Baioco, T. Basso, R. Moraes, A security model for access control in graph-oriented databases, in: International Conference on Software Quality, Reliability and Security, IEEE, 2018, pp. 135–142. [20] R. S. Sandhu, P. Samarati, Access control: principle and practice, IEEE communications magazine 32 (1994) 40–48.