<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovery of Ontologies from Implicit User Knowledge</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Computer Science 6 (Data Management) University of Erlangen https://</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The purpose of the Semantic Web is to enable worldwide access to humanity's knowledge in a machine-processable way. A major obstacle to this has been that knowledge is often either represented in an incoherent way, or not externalized at all and only present in people's minds. Populating a knowledge graph and manually building an ontology by a domain expert is tedious work, requiring great initial effort until the result can be used. As a consequence, knowledge will often never be made available to the Semantic Web. The aim of this project is to develop a new approach for building ontologies from implicit user knowledge that is already present, but hidden in various artifacts like SQL query logs or application usage patterns.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Web</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Schema Inference</kwd>
        <kwd>Query-Driven</kwd>
        <kwd>Data Integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the last decades, the World Wide Web indisputably changed human society
and economy. Computers, paradoxically, although essentially operating the Web,
cannot make use of it on their own. Knowledge on the Web is represented mostly
in a way suitable for humans, as pages containing plain text and graphics.
Although web pages can be structured hierarchically and linked with each other,
their inherent semantics are only accessible by a human being perceiving the
content. Querying the Web is usually restricted to simple keyword-based search
engines or web services with proprietary APIs. Apart from these technical
obstacles, the Web also does not define coherent sets of terms that shall be used
to describe concepts and entities of a particular domain, leaving that tasks to
human interpretation.</p>
      <p>
        The Semantic Web [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] offers a framework for machines to make use of this
knowledge. Instead of storing linked HTML documents, the Semantic Web links
facts with each other. This is done using the Resource Description Format
(RDF), which operates on a graph-based data model. The graph can then be
queried using SPARQL, the default RDF query language, which has a similar
expressivity as SQL has for relational databases.
      </p>
      <p>For being able to actually interpret RDF data, an ontology must be defined.
This can be done either in RDF Schema or in the Web Ontology Language
(OWL). In a nutshell, an ontology is a set of axioms which constrain what
statements can or cannot be true and allows to deduce new statements from existing
statements. Creating these ontologies manually is tedious work and therefore a
blocker for Semantic Web adoption.</p>
      <p>Knowledge already exists somewhere, either in people’s minds or in various
kinds of artifacts: semi-structured file formats like CSV or JSON, plain text in
natural language, applications source code, log files, or SQL queries.
Transferring all this knowledge by hand into a graph is time-consuming and expensive,
wherefore this can be applied only for limited use cases. Developing an at least
partly automated method to perform this task could drastically lower the costs
for deploying Semantic Web techniques.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Past Research</title>
      <p>
        This PhD research project will extend the scope of the previous master thesis
project Pharos, which results have been published in a followup research paper
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The focus of Pharos was to improve the understanding of heterogeneous
data sources within a data lake by analyzing SQL query logs accessing these
sources and by extracting knowledge fragments from those queries in order to
gain insights about the underlying schema. This may seem unintuitive at first
glance, as SQL is usually associated with relational databases, where schemata
are already known, but SQL has evolved into a general query language for
heterogeneous data sources. When a data scientist encounters an unknown data
source, he needs a great deal of cognitive effort to understand its semantics prior
to writing the queries that use the data sources for analytics. Therefore, each
query implicitly contains hidden knowledge and assumptions about the data and
can be seen as a partial schema definition.
      </p>
      <p>For example, joining two tables over an attribute indicates that the data
analyst probably identified a foreign-key-relationship, otherwise he would not
have made that join. Renaming columns with speaking names or explicit type
casts give hints about their meaning.
s e l e c t sum( p . s a l a r y ) , dep . id , dep . name
from p e r s o n p jo in department dep
on p . dep_id = dep . i d
where dep . l o c a t i o n= ’DE ’ or dep . l o c a t i o n = ’FR ’
group by dep . id , dep . name
order by dep . name ;</p>
      <p>Listing 1.1. Example query with partial schema information
name</p>
      <p>id
has_attribute has_attribute</p>
      <p>has_attribute
joined_with</p>
      <p>department
has_attribute</p>
      <p>has_attribute
location
equals
equals
’FR’
is_type
string
salary
person
dep_id</p>
      <p>’DE’</p>
      <p>We had decided to build a knowledge graph from SQL query logs that could
be used to help understanding the mental model behind data sources. A
prototype was written that demonstrates the feasibility of the approach. It was
implemented in form of a JDBC proxy driver that can capture and analyze all
SQL queries a Java application sends to a SQL query engine like Apache Drill,
allowing a minimal-invasive deployment of the prototype into existing workflows,
as it is compatible with any software using JDBC drivers. The prototype was
evaluated using a test database with a known schema and a set of test queries,
based on exercises of our introductory database lecture.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Research Objectives</title>
      <p>The resulting knowledge graph describes how data sources have been used by
analysts, but does not describe the semantics of the data sources themselves,
like value constraints or foreign key relationships. Human interpretation of the
results is required to gain insights about the used data sources. Therefore, the
next level will be to perform automatic reasoning on top of the knowledge graph.
This requires to generate an (incomplete) ontology for describing the semantics
of data sources. As knowledge derived from SQL queries may be contradictory
when the query log contains queries that are not conforming to the underlying
schema - an approximate approach is needed to deal with this ambiguity.</p>
      <p>Queries do not reflect the semantics of a data source, but the mental model
a data scientist has made of it. Analyzing these mental models can already
give valuable insights. For example, if someone uses a “grade” attribute and
compares it with values that are not present in the dataset, the explanation
could be that the user originated from a country with a different school grading
system. There are multiple concepts out there about what a “grade” should be,
a query-driven approach could provide more transparency about these concepts
as an intermediate step.</p>
      <p>When analyzing SQL query logs, there will always be queries that are based
on wrong assumptions about the schema, especially if the origin of the query
log is from an interactive session where queries with undesired results may be
rewritten. With multiple query logs from different sessions and users, finding the
similarities in their behavior and their mental model could lead to the intended
semantics of the used data sources.</p>
      <p>A self-learning system shall be developed that makes suggestions to data
scientists about suitable sources or queries they may find helpful for their task.
Based on their given feedback and performed queries, the system shall
incrementally approximate the true semantics behind the data sources.</p>
      <p>Thus far, only SQL query logs were considered as a source for query-driven
schema inference. But there are other types of queries to consider, like query
strings from search engines, application usage patterns extracted from graphical
analysis tools or even source code from programs accessing a data source. Other
query languages like XQuery or languages from various NoSQL database systems
could be included. The approach does not depend on a specific language.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        Many approaches for schema inference are data-driven, using data profiling
methods to reconstruct the underlying schema of a given dataset. A significant
example is the Metanome project [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which provides an extensible framework
offering various algorithms, for example to discover functional dependencies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Datatype-based schema inference for JSON datasets is demonstrated in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] it is shown how to identify the domains the values of a column come
from. The Datamaran project [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] aims to discover structure in text files like
applications logs and transforming them into normalized relational tables. The
ESKAPE platform [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] allows users to assign instances to semantic models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
A general overview of dataset search and integration techniques is given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation Approach</title>
      <p>The existing prototype will be extended to use Semantic Web reasoning
techniques to deduce the meaning of a data source by the knowledge extracted from
query fragments. A framework of rules should be defined to achieve this, possibly
with the Rule Interchange Format (RIF). This prototype shall then be tested on
real world query logs, so the resulting knowledge graph can be compared with
the actual schema the data sources are based on. A supplementary user study
will show if the software is able to enhance the workflow of data scientists to
understand heterogeneous data sources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A</given-names>
            <surname>Semantic Web</surname>
          </string-name>
          <article-title>Primer</article-title>
          .
          <source>Cooperative Information Systems</source>
          , MIT Press, Cambridge, Mass, 3rd ed edn. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baazizi</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ben</surname>
            <given-names>Lahmar</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Colazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Ghelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Sartiani</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Schema inference for massive JSON datasets</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Extending Database Technology</source>
          . pp.
          <fpage>222</fpage>
          -
          <lpage>233</lpage>
          . Venice,
          <string-name>
            <surname>Italy</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baazizi</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colazzo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghelli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sartiani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Parametric schema inference for massive JSON datasets</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>28</volume>
          (
          <issue>4</issue>
          ),
          <fpage>497</fpage>
          -
          <lpage>521</lpage>
          (
          <year>Aug 2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koesten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstantinidis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ibáñez</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kacprzak</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Dataset search: A survey</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>29</volume>
          (
          <issue>1</issue>
          ),
          <fpage>251</fpage>
          -
          <lpage>272</lpage>
          (
          <year>Jan 2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parameswaran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets</article-title>
          .
          <source>In: Proceedings of the 2018 International Conference on Management of Data - SIGMOD '18</source>
          . pp.
          <fpage>943</fpage>
          -
          <lpage>958</lpage>
          . ACM Press, Houston, TX, USA (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Haller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenz</surname>
          </string-name>
          , R.: Pharos:
          <article-title>Query-Driven Schema Inference for the Semantic Web</article-title>
          .
          <source>In: Machine Learning and Knowledge Discovery in Databases</source>
          , vol.
          <volume>1168</volume>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>124</lpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Holistic primary key and foreign key detection</article-title>
          .
          <source>J Intell Inf Syst (Jun</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ota</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freire</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Data-driven domain discovery for structured datasets</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <issue>7</issue>
          ),
          <fpage>953</fpage>
          -
          <lpage>967</lpage>
          (
          <year>Mar 2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Papenbrock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zwiener</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Data profiling with metanome</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>8</volume>
          (
          <issue>12</issue>
          ),
          <fpage>1860</fpage>
          -
          <lpage>1863</lpage>
          (
          <year>Aug 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pomp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kraus</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meisen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Semantic Concept Recommendation for Continuously Evolving Knowledge Graphs</article-title>
          .
          <source>In: Enterprise Information Systems</source>
          , vol.
          <volume>378</volume>
          , pp.
          <fpage>361</fpage>
          -
          <lpage>385</lpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Pomp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeschke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meisen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>ESKAPE: Information Platform for Enabling Semantic Data Processing:</article-title>
          .
          <source>In: Proceedings of the 19th International Conference on Enterprise Information Systems</source>
          . pp.
          <fpage>644</fpage>
          -
          <lpage>655</lpage>
          . SCITEPRESS - Science and Technology Publications, Porto, Portugal (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>