<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards rapidly developing database-supported machine learning applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frank Rosner</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Hinneburg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science, Martin-Luther-University Halle-Wittenberg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Global Data and Analytics</institution>
          ,
          <addr-line>Allianz SE</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The development of a big data analytics application benets from a conceptual model that jointly represents aspects about data management as well as machine learning. We demonstrate a recently proposed method to translate a Bayesian network into a usable entity relationship model using the real world example of the TopicExplorer system. TopicExplorer is an interactive web application for text mining that uses Bayesian topic models as a core component. Further, we sketch a vision of a conceptual framework that eases machine learning speci c development tasks during building big data analytics applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The implementation of a big data analytics application requires to join data
management software with machine learning tools. However, the elds of data
management and machine learning developed quite di erent models and
notations. The former frequently uses entity-relationship models (ERM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] while
the latter uses probabilistic graphical models in particular Bayesian networks
(BN) to communicate key concepts during development. Even while both kinds
of graphical notations show many details of the data, information explicit on
one side remains implicit on the other one and vice versa | there is no natural
understanding of the two worlds. However, a common conceptual description
of the contributions from both worlds is crucial for the successes of big data
analytics development projects.
      </p>
      <p>
        Recently, we proposed a translation [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] from a graphical BN model in
plate notation into an entity relationship model. Such ERM can be easily
integrated into the overall ERM of the whole application. Thus, we gain the
advantage of a formal conceptual view of the machine learning part that is integrated
into the conceptual view of the data management side. Thereby, developers from
the data management side understand the basic in- and outputs of the machine
learning part that remains no longer as a black box behind an abstract API in
the data management model.
      </p>
      <p>We demonstrate the method in the real world example of the TopicExplorer
in Section 2. Based on this, we describe our vision of a conceptual framework that
uses pre-translated BNs as a library of ERM snippets in Section 3. Such library
could be used by data management developers to conceptually include machine</p>
      <sec id="sec-1-1">
        <title>Topics (K)</title>
        <p>n
znm
dnm</p>
      </sec>
      <sec id="sec-1-2">
        <title>Tokens (Mn)</title>
        <p>Documents (N )
(a) LDA plate model
kv
k
kv
nk
znmk
dnmv</p>
      </sec>
      <sec id="sec-1-3">
        <title>Tokens (Mn)</title>
      </sec>
      <sec id="sec-1-4">
        <title>Documents (N )</title>
      </sec>
      <sec id="sec-1-5">
        <title>Topics (K)</title>
      </sec>
      <sec id="sec-1-6">
        <title>Words (V )</title>
        <p>(b) Atomic LDA plate model
learning methods into analytics applications. We believe that software
development of big data analytics applications could bene t from machine learning
implementations that are attached to the pre-translated BNs. Last, we discuss
related work in Section 4 and conclude the paper in Section 5.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Case Study: Text Topic Modeling</title>
      <p>
        We demonstrate the recent method [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] to translate BN to ERM using the
example of the TopicExplorer [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], an application to explore document
collections using probabilistic topic models [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We explain how the translated LDA
topic model is represented as ERM. Furthermore, we show use cases for typical
analyses supported by the translated ERM.
2.1
      </p>
      <p>
        Translation of Latent Dirichlet Allocation to ERM
LDA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] models a collection of documents that is indexed by the set N . Each
document n 2 N consists of a set of tokens Mn that represents the words
occurring in this document. The document speci c token index sets Mn partition
the total index set of tokens M . A token m 2 Mn corresponds to exactly one
word type v from a vocabulary V . In the Bayesian network, Figure 1a, this
information is coded as a bit vector dnm 2 f0; 1gjV j that has exactly a single 1
at the index associated with the respective word v 2 V . Each token m 2 Mn
is also assigned to a topic k 2 K. This assignment is coded by the bit vector
znm 2 f0; 1gjKj, which has a single 1 at the respective topic index. Furthermore,
each topic has its own word distribution parameterized by a vector of positive
real number k 2 RjV j. The topic proportions per document are represented by a
similar vector n 2 RjKj. The hyper-parameter vectors 2 RjKj and k 2 RjV j
regulate the prior distributions for the respective hidden parameters n and k.
      </p>
      <p>
        The translation [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] delivers the ERM shown in the right part of Figure
2 for the given LDA plate model (Figure 1a). The translation employs several
intermediate steps, one of which is the transformation of the plate model into
      </p>
      <p>Date
Content</p>
      <p>URL
TokenID
Position</p>
      <p>POS
WordID
Name</p>
      <p>DocID
Document
(1,*)
(1,1)
(1,1)
(1,*)
Token
Word</p>
      <p>DocID
Document</p>
      <p>(1,*)
(1,*)
(1,1)
(1,1)
(1,*)</p>
      <p>Token
Word</p>
      <p>TokenID
(1,1)
WordID
(1,*)
(1,*)</p>
      <p>
        TopicID
D-T
Topic
(1,*)
(1,*)
T-W
an atomic plate model (APM), see Figure 1b. The APM represents implicit
relational information hidden in the original plate model in an explicit way
using the plate notation. By the transformation and reduction rules [
        <xref ref-type="bibr" rid="ref13 ref14 ref8">13, 14, 8</xref>
        ],
the APM is translated into a sequence of several intermediary ERMs and then
reduced into a usable nal ERM.
      </p>
      <p>Such ERM is close to a manually designed ERM for LDA. A document
consists of one or more tokens which are of exactly one word type. Each token
is assigned to a topic, while one topic can have multiple tokens assigned. The
inferred topic mixture for each document is stored in D-T. , while T-W. holds
the word probabilities for each topic. The hyper-parameter of the prior for the
topic mixture resides as an attribute of the topic entity type. The parameters
for the individual priors for the word distributions are stored in T-W. .
2.2</p>
      <p>TopicExplorer
TopicExplorer is a web application that helps users from the humanities to work
with topic models, e.g. in a collaboration with the institute for Japanese
studies at Martin-Luther-University, we analyzed blog posts about the Fukushima
disaster. After crawling relevant blogs, each blog entry is preprocessed by
computer linguistic software to extract tokens from full text and store them in their
lemmatized forms together with their part-of-speech tags (e.g. noun, verb or
adjective) and their string positions in the text.</p>
      <p>TopicExplorer interactively visualizes the topic structure of the documents.
The visualizations require to join the data about documents and words together
with results from the topic model. An ERM that would integrate the left and the
right part of Figure 2 is obtained by merging the matching entities from both
sides. It gives the application developer a good idea how to access those data,
without needing to understand the machine learning details of a topic model.</p>
      <p>Data
Inference Algorithm</p>
      <p>Domain
Data Model</p>
      <p>Data Model Driven
(C)</p>
      <p>Data Model of</p>
      <p>Probabilistic Model
Parameter
Estimate
(D) Application Development</p>
      <p>Integrated</p>
      <p>Data Model
(A + B)
Data Management Components
Interactive User Interface</p>
      <p>
        We present how to derive a few visualizations that are part of the current version
of TopicExplorer [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Document topic mixture. LDA assigns a vector of topic probabilities stored
in D-T. for each document, called a topic mixture. This could be presented as
a list of topics with decreasing order of probabilities.</p>
      <p>Topic Documents. Reversing the idea behind the document topic mixture,
one can visualize a topic as a list of the most representative documents for this
topic. This is done by joining Document, Token and Topic, grouping by both
IDs of documents and topics and counting the number of tokens in each group.
For each topic the entries are sorted with decreasing token count, yielding a list
of representative documents.</p>
      <p>Topic words. As stated above, a topic can be represented as a list of words
sorted in decreasing probability (T-W. ). Furthermore, the topics appear linearly
ordered by similarity in the user interface to allow browsing in a semantically
uninterrupted way. Computing all pair-wise similarities between topics is well
supported by a relational database using table T-W.</p>
      <p>Topic frames. Another visualization of topics uses the concept of frames. A
topic frame consists of a noun and a verb that are assigned to the same topic
and appear close together in the same documents. Topic frames can be computed
using Token, Word and Topic, grouping by topic ID and word IDs of the frame
tokens, and counting the number of frames using the same words.
Topic time. To analyze how discussions in blogs evolve, TopicExplorer allows
to visualize the number of tokes assigned to topics over time. This analysis is
also directly supported by joining documents with topics, grouping by date and
topic ID and then counting the tokens.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conceptual Modeling Framework</title>
      <p>Based on the method for translating probabilistic models to ERMs and our
experiences with the development of the TopicExplorer system, we propose a
rst idea for a new data model driven development approach dedicated to big
data analytics applications. Figure 3 visualizes the traditional and our proposed
data model driven development approaches. Both address four di erent tasks,
namely (A) gather the data sources and make them available to a probabilistic
model, (B) run machine learning components, (C) integrate the data sources
with the machine learning output and (D) build the application consisting of
data management components and an interactive user interface.</p>
      <p>
        The traditional approach addresses the tasks mainly in sequential order. The
rst three steps implement data mining process following the CRISP model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
while the last step addresses standard application development.
      </p>
      <p>
        The translation method [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] from BN to ERM allows an alternative, data
model driven approach. We assume that for a wide spectrum of machine learning
problems abstract, readily developed BNs already do exist. Those BNs could be
pre-translated to ERMs to build a library. Thus, conceptual information about
the machine learning component is already available when integrating the data
sources, task (C). The BN could be treated as just another data source. The
entities corresponding to observed variables in the BN, including their respective
relationships, have to be matched with those from other available data sources.
The matching conceptually de nes the interface between data sources and
machine learning, task (A). Furthermore, translating the integrated ERM into a
(relational) model for a big data framework, e.g. Flink [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or Spark [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
conceptually de nes the API between the output of machine learning and the rest of
the application, task (B). Depending on the framework, the tasks (A) and (B)
could be supported by generation of e cient code for interfaces to access the
given data as well as machine learning implementations.
      </p>
      <p>As a consequence, application developers just need knowledge about in- and
outputs, and the relationships among the variables in the Bayesian model, but
not about probabilistic distributions and dependencies. Thus, our new data
model driven approach eliminates unnecessary complexity caused by a lack of
compatible conceptual languages on both sides of machine learning and data
management. Thereby, it makes the collaboration between both sides more
direct and o ers potential for optimization.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        There are several approaches that combine data management with machine
learning, however, none of them reaches a comparable conceptual level like
ERMs. Hazy [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] provides programming, infrastructure and statistical
processing abstractions, the latter are based on Markov logic [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This requires a deeper
understanding of the machine learning algorithms in order to combine them
effectively with data management. Several approaches combine machine learning
APIs with SQL [1{3, 9] or with their own declarative language [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Last, data management is combined with machine learning at the level of user
interfaces. An recent example is scikit-learn [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which enable users to quickly
select data sources and try di erent algorithms. Both do not o er an easy way to
integrate machine learning results with domain speci c meta data. Our approach
also contrasts with statistical programming languages and software like R and
SAS that just o er programming APIs to data sources and machine learning
algorithms.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Knowledge about the machine learning side of a project helps developers to build
a big data analytics application. The subsequently proposed framework gives
guidelines how to e ectively build an integrated conceptual model that includes
details about domain speci c aspects as well as the machine learning side of
a big data analytics application. Future work includes the implementation of
the framework and optimizing e ciency when translating integrated conceptual
models to a particular implementation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Akdere</surname>
          </string-name>
          , Cetintemel, Riondato, et al.
          <article-title>The case for predictive database systems: Opportunities and challenges</article-title>
          .
          <source>In CIDR</source>
          , pp.
          <volume>167</volume>
          {
          <issue>174</issue>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alexandrov</surname>
          </string-name>
          , Bergmann, Ewen, et al.
          <article-title>The stratosphere platform for big data analytics</article-title>
          .
          <source>The VLDB Journal</source>
          ,
          <volume>23</volume>
          (
          <issue>6</issue>
          ):
          <volume>939</volume>
          {
          <fpage>964</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Armbrust</surname>
          </string-name>
          , Xin, Lian et. al.
          <article-title>Spark sql: Relational data processing in spark</article-title>
          .
          <source>In SIGMOD</source>
          , pp.
          <volume>1383</volume>
          {
          <issue>1394</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Blei</surname>
          </string-name>
          , Ng, and
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The entity-relationship model|toward a uni ed view of data</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS)</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):9{
          <fpage>36</fpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Domingos and
          <string-name>
            <surname>Richardson.</surname>
          </string-name>
          <article-title>Markov logic: A unifying framework for statistical relational learning. Introduction to statistical relational learning</article-title>
          , pp.
          <volume>339</volume>
          {
          <issue>371</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Han, Kamber, and
          <string-name>
            <surname>Pei</surname>
          </string-name>
          . Data Mining:
          <article-title>Concepts and Techniques</article-title>
          . MK Pub.,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Heckerman</surname>
          </string-name>
          , Meek, and
          <string-name>
            <surname>Koller</surname>
          </string-name>
          .
          <article-title>Probabilistic entity-relationship models, prms, and plate models. Introduction to statistical relational learning</article-title>
          , pp.
          <volume>201</volume>
          {
          <issue>238</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hellerstein</surname>
          </string-name>
          , Re, Schoppmann, et al.
          <article-title>The madlib analytics library: or mad skills, the sql</article-title>
          .
          <source>VLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>12</issue>
          ):
          <volume>1700</volume>
          {
          <fpage>1711</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hinneburg</surname>
          </string-name>
          , Oberlander,
          <string-name>
            <surname>Rosner</surname>
          </string-name>
          , et al..
          <article-title>Exploring document collections with topic frames</article-title>
          .
          <source>in CIKM 2014</source>
          , pp.
          <year>2084</year>
          {
          <year>2086</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hinneburg</surname>
          </string-name>
          , Preiss, and Schroder. Topicexplorer:
          <article-title>Exploring document collections with topic models</article-title>
          .
          <source>In PKDD</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , pp.
          <volume>838</volume>
          {
          <issue>841</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , Niu, and Re. Hazy:
          <article-title>Making it easier to build and maintain big-data analytics</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>56</volume>
          (
          <issue>3</issue>
          ):
          <volume>40</volume>
          {
          <fpage>49</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <article-title>Rosner and Hinneburg. Translating bayesian networks into entity relationship models</article-title>
          .
          <source>In 35th Int. Conf. on Conceptual Modeling, ER</source>
          ,
          <year>2016</year>
          . to appear.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <article-title>Rosner and Hinneburg. Translating Bayesian Networks into Entity Relationship Models, Extended Version</article-title>
          . ArXiv e-prints,
          <volume>1607</volume>
          .02399,
          <year>July 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Scikit.</surname>
          </string-name>
          scikit-learn,
          <year>2014</year>
          . Machine Learning in Python.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sparks</surname>
          </string-name>
          , Talwalkar,
          <string-name>
            <surname>Smith</surname>
          </string-name>
          , et al.
          <article-title>Mli: An api for distributed machine learning</article-title>
          .
          <source>In ICDM</source>
          , pp.
          <volume>1187</volume>
          {
          <issue>1192</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>