<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enterprise Data Classi cation Using Semantic Web Technologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Ben-David</string-name>
          <email>davidbd@cs.technion.ac.il</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tamar Domany</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abigail Tarem</string-name>
          <email>abigailtg@il.ibm.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Haifa, University Campus</institution>
          ,
          <addr-line>Haifa 31905</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM Research</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Organizations today collect and store large amounts of data in various formats and locations, however they are sometimes required to locate all instances of a certain type of data. Data classi cation enables e cient retrieval of information when needed. This work presents a reference implementation for enterprise data classi cation using Semantic Web technologies. We demonstrate automatic discovery and classi cation of Personally Identi able Information (PII) in relational databases, using a classi cation model in RDF/OWL describing the elements to discover and classify. At the end of the process the results are also stored in RDF, enabling simple navigation between the input model and the ndings in di erent databases. Recorded demo link: https://www.research.ibm.com/haifa/info/demos/ piidiscovery_full.htm</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Techniques</kwd>
        <kwd>RDF</kwd>
        <kwd>Classi cation</kwd>
        <kwd>modeling</kwd>
        <kwd>NeON</kwd>
        <kwd>RelationalOWL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Organizations today collect and store large amounts of data in various formats
and locations. When an organization is required to meet certain legal or
regulatory requirements, for instance to comply with regulations or perform discovery
during civil litigation, it needs to nd all the places where the required data is
located. Data discovery and classi cation is about nding and marking enterprise
data in a way that enables quick and e cient retrieval of the relevant
information when needed. Most existing approaches either require re-classi cation of the
data each time the organization's policies change, can only be applied to a single
data type or format, or only identify prede ned sets of known elds.</p>
      <p>
        In this work we demonstrate the concept of enterprise data classi cation
using Semantic Web technologies described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The goal of the solution is to
provide organizations with a tool that automatically locates and annotates
valuable information, provides manageable results and enables quick and easy access
to the data when needed. For example, if in order to comply with a privacy
regulation an organization is required to mask all Social Security Numbers (SSN),
all the occurrences of SSN must be found.
      </p>
      <p>
        This reference implementation demonstrates the automatic discovery and
classi cation of Personally Identi able Information (PII) stored in relational
databases. The classi cation process starts with creating a model described using
the Resource Description Framework (RDF), containing the entities to discover
and classify as well as additional information that can help the discovery process
(e.g., type and format); this is referred to as a classi cation model. In this demo
we used a model representing PII, but any model that follows the meta-model
described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] can be used. The result of the classi cation process is a set of
RDF triples linking between entities in the classi cation model and locations
in the data stores, in this case database tables and columns. Using RDF to
de ne the classi cation model makes it easy to expand, merge and combine
existing models and generate new models for di erent purposes. The fact that the
classi cation results are also represented in RDF that follow the same shema as
the classi cation model, enables us to unify the results from di erent classi ers,
navigate easily between the model entities and the data sources (thanks of the
use of URIs), annotate, reason, and query the classi ed data, and more. The
classi cation process is composed of four stages:
      </p>
    </sec>
    <sec id="sec-2">
      <title>1. Creating or loading an existing classi cation model.</title>
      <p>2. Importing database schemas.
3. Discovering and classifying the data according to the classi cation model,
using SPARQL and various classi cation algorithms.
4. Representing the results in a way that allows navigation between the
classication model and the speci c columns where the information was found.</p>
    </sec>
    <sec id="sec-3">
      <title>A high-level view of this process is depicted in Figure 1.</title>
      <sec id="sec-3-1">
        <title>Implementation</title>
        <p>
          Our reference implementation is based on the NeOn toolkit [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], an open-source,
Eclipse-based ontology engineering environment. In addition we used the Eclipse
Data Tools Platform (DTP)[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to de ne connections with local and remote
databases. We chose RelationalOWL [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] as the basis to create an RDF
representation of the database metadata. We also used the Jena framework [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to
access and query the di erent RDF representations. To those we added a set
of \home-made" plug-ins that perform the discovery and integrate between the
di erent components in the system. The discovery component uses the
syllabifying techniques described in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], as well as type checking. Future extensions are
planned to include additional linguistic techniques (such as stemming) and the
use of sample content to verify the data's format.
        </p>
        <p>Using our classi cation tool users can create projects, build and edit models,
import existing models (in both cases, the models are validated against the
metamodel) and import or create database metadata RDF representations. Users can
perform the discovery process on any combination of models and databases. The
results, as well as the models and database metadata, can be viewed in both a
hierarchical view and a graph view, as depicted in Figure 2. Figure 2 shows part
of the discovery results (in this case - all table columns in which a rst name
was discovered).
For the purpose of this demonstration, we execute our classi cation on two
externally available databases: one representing employee records in an
organization (taken from the sample database created by the DB2 R software
installation) and the other representing medical records of patients (taken from "Avitek
Medical Records Development Tutorials" by BEA Systems, Inc. 3).
3 http://download.oracle.com/docs/cd/E13222_01/wls/docs100/medrec_
tutorials/index.html</p>
        <p>As noted previously, we use RDF to represent the discovery results, making
it possible to navigate from any node in the result back to both the classi cation
eld in the model, and to the data eld (column) in the database representation.
This easy navigation allows verifying the classi cation results, re ning them
(adding or removing triples), and enriching the model so it is more accurate in
subsequent runs.
3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Summary</title>
        <p>In this demonstration we exhibited the main advantages of our approach. By
combining di erent discovery techniques and extracting most of the search logic
to external les, we created a highly exible and adaptable solution. Using RDF
to represent both the ontologies and the results maximizes the modularity and
extensibility of the classi cation input and facilitates easy navigation between
the results, the models and the data sources. The ontology can thus serve as
a centralized point to manage all valuable information in the organization and
enables easy location of all related pieces of data in one click. In addition, all
of the information created and used by the system (models, metadata RDF
representations, results) can be exposed to existing and evolving Semantic Web
tools, such as semantic query languages, reasoning engines and rule languages.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Eclipse</given-names>
            <surname>Data Tools</surname>
          </string-name>
          <article-title>Platform (DTP) Project. data sheet</article-title>
          , http://www.eclipse.org/ datatools/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ben-David</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domany</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Enterprise Data Classi cation using Semantic Web Technologies</article-title>
          . In: ISWC (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Carroll</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dickinson</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>D.R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Jena: Implementing the Semantic Web Recommendations</article-title>
          .
          <source>Tech. rep., HP Laboratories</source>
          (
          <year>2003</year>
          ), http://www.hpl.hp.com/techreports/2003/HPL-2003-146.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Holger</surname>
            ,
            <given-names>P.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Studer</surname>
            ,
            <given-names>L.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The NeOn Ontology Engineering Toolkit</article-title>
          . In: ISWC (
          <year>2009</year>
          ), http://www.aifb.uni-karlsruhe.de/WBS/pha/publications/ neon-toolkit.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. de Laborda,
          <string-name>
            <given-names>C.P.</given-names>
            ,
            <surname>Conrad</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>RelationalOWL: a data and schema representation format based on OWL</article-title>
          .
          <source>In: Conferences in Research and Practice in Information Technology</source>
          . pp.
          <volume>89</volume>
          {
          <issue>96</issue>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>