-

Enterprise Data Classi cation Using Semantic Web Technologies

David Ben-David

davidbd@cs.technion.ac.il

Tamar Domany

Abigail Tarem

abigailtg@il.ibm.com 1 0 Haifa, University Campus , Haifa 31905 , Israel 1 IBM Research

Organizations today collect and store large amounts of data in various formats and locations, however they are sometimes required to locate all instances of a certain type of data. Data classi cation enables e cient retrieval of information when needed. This work presents a reference implementation for enterprise data classi cation using Semantic Web technologies. We demonstrate automatic discovery and classi cation of Personally Identi able Information (PII) in relational databases, using a classi cation model in RDF/OWL describing the elements to discover and classify. At the end of the process the results are also stored in RDF, enabling simple navigation between the input model and the ndings in di erent databases. Recorded demo link: https://www.research.ibm.com/haifa/info/demos/ piidiscovery_full.htm

Semantic Techniques RDF Classi cation modeling NeON RelationalOWL

Organizations today collect and store large amounts of data in various formats and locations. When an organization is required to meet certain legal or regulatory requirements, for instance to comply with regulations or perform discovery during civil litigation, it needs to nd all the places where the required data is located. Data discovery and classi cation is about nding and marking enterprise data in a way that enables quick and e cient retrieval of the relevant information when needed. Most existing approaches either require re-classi cation of the data each time the organization's policies change, can only be applied to a single data type or format, or only identify prede ned sets of known elds.

In this work we demonstrate the concept of enterprise data classi cation using Semantic Web technologies described in [ 2 ]. The goal of the solution is to provide organizations with a tool that automatically locates and annotates valuable information, provides manageable results and enables quick and easy access to the data when needed. For example, if in order to comply with a privacy regulation an organization is required to mask all Social Security Numbers (SSN), all the occurrences of SSN must be found.

This reference implementation demonstrates the automatic discovery and classi cation of Personally Identi able Information (PII) stored in relational databases. The classi cation process starts with creating a model described using the Resource Description Framework (RDF), containing the entities to discover and classify as well as additional information that can help the discovery process (e.g., type and format); this is referred to as a classi cation model. In this demo we used a model representing PII, but any model that follows the meta-model described in [ 2 ] can be used. The result of the classi cation process is a set of RDF triples linking between entities in the classi cation model and locations in the data stores, in this case database tables and columns. Using RDF to de ne the classi cation model makes it easy to expand, merge and combine existing models and generate new models for di erent purposes. The fact that the classi cation results are also represented in RDF that follow the same shema as the classi cation model, enables us to unify the results from di erent classi ers, navigate easily between the model entities and the data sources (thanks of the use of URIs), annotate, reason, and query the classi ed data, and more. The classi cation process is composed of four stages:

1. Creating or loading an existing classi cation model.

2. Importing database schemas. 3. Discovering and classifying the data according to the classi cation model, using SPARQL and various classi cation algorithms. 4. Representing the results in a way that allows navigation between the classication model and the speci c columns where the information was found.

A high-level view of this process is depicted in Figure 1. Implementation

Our reference implementation is based on the NeOn toolkit [ 4 ], an open-source, Eclipse-based ontology engineering environment. In addition we used the Eclipse Data Tools Platform (DTP)[ 1 ] to de ne connections with local and remote databases. We chose RelationalOWL [ 5 ] as the basis to create an RDF representation of the database metadata. We also used the Jena framework [ 3 ] to access and query the di erent RDF representations. To those we added a set of \home-made" plug-ins that perform the discovery and integrate between the di erent components in the system. The discovery component uses the syllabifying techniques described in [ 2 ], as well as type checking. Future extensions are planned to include additional linguistic techniques (such as stemming) and the use of sample content to verify the data's format.

Using our classi cation tool users can create projects, build and edit models, import existing models (in both cases, the models are validated against the metamodel) and import or create database metadata RDF representations. Users can perform the discovery process on any combination of models and databases. The results, as well as the models and database metadata, can be viewed in both a hierarchical view and a graph view, as depicted in Figure 2. Figure 2 shows part of the discovery results (in this case - all table columns in which a rst name was discovered). For the purpose of this demonstration, we execute our classi cation on two externally available databases: one representing employee records in an organization (taken from the sample database created by the DB2 R software installation) and the other representing medical records of patients (taken from "Avitek Medical Records Development Tutorials" by BEA Systems, Inc. 3). 3 http://download.oracle.com/docs/cd/E13222_01/wls/docs100/medrec_ tutorials/index.html

As noted previously, we use RDF to represent the discovery results, making it possible to navigate from any node in the result back to both the classi cation eld in the model, and to the data eld (column) in the database representation. This easy navigation allows verifying the classi cation results, re ning them (adding or removing triples), and enriching the model so it is more accurate in subsequent runs. 3

Summary

In this demonstration we exhibited the main advantages of our approach. By combining di erent discovery techniques and extracting most of the search logic to external les, we created a highly exible and adaptable solution. Using RDF to represent both the ontologies and the results maximizes the modularity and extensibility of the classi cation input and facilitates easy navigation between the results, the models and the data sources. The ontology can thus serve as a centralized point to manage all valuable information in the organization and enables easy location of all related pieces of data in one click. In addition, all of the information created and used by the system (models, metadata RDF representations, results) can be exposed to existing and evolving Semantic Web tools, such as semantic query languages, reasoning engines and rule languages.

Eclipse

Data Tools Platform (DTP) Project. data sheet , http://www.eclipse.org/ datatools/

2. Ben-David , D. , Domany , T. , Tarem , A. : Enterprise Data Classi cation using Semantic Web Technologies . In: ISWC ( 2010 )

3. Carroll , J.J. , Dickinson , I. , Dollin , C. , Seaborne , D.R.A. , Wilkinson , K. : Jena: Implementing the Semantic Web Recommendations . Tech. rep., HP Laboratories ( 2003 ), http://www.hpl.hp.com/techreports/2003/HPL-2003-146.pdf

4. Holger , P.H. , Studer , L.R. , Tran , T. : The NeOn Ontology Engineering Toolkit . In: ISWC ( 2009 ), http://www.aifb.uni-karlsruhe.de/WBS/pha/publications/ neon-toolkit.pdf

5. de Laborda, C.P. , Conrad , S.: RelationalOWL: a data and schema representation format based on OWL . In: Conferences in Research and Practice in Information Technology . pp. 89 { 96 ( 2005 )