=Paper=
{{Paper
|id=Vol-2022/paper20
|storemode=property
|title=
An Approach to Data Mining inside PostgreSQL Based on Parallel Implementation of UDFs

|pdfUrl=https://ceur-ws.org/Vol-2022/paper20.pdf
|volume=Vol-2022
|authors=Timofey Rechkalov,Mikhail Zymbler
|dblpUrl=https://dblp.org/rec/conf/rcdl/RechkalovZ17a
}}
==
An Approach to Data Mining inside PostgreSQL Based on Parallel Implementation of UDFs
==
<pdf width="1500px">https://ceur-ws.org/Vol-2022/paper20.pdf</pdf>
<pre>
      An Approach to Data Mining Inside PostgreSQL Based
              on Parallel Implementation of UDFs

                 © Timofey Rechkalov                             © Mikhail Zymbler
                                    South Ural State University,
                                       Chelyabinsk, Russia
                           trechkalov@yandex.ru          mzym@susu.ru
           Abstract. Relational DBMSs remain the most popular tool for data processing. However, most of
     stand-alone data mining packages process flat files outside a DBMS. In-database data mining avoids export-
     import data/results bottleneck as opposed to use stand-alone mining packages and keeps all the benefits
     provided by DBMS. The paper describes an approach to data mining inside PostgreSQL based on parallel
     implementation of user-defined functions (UDFs) for modern Intel many-core platforms. The UDF
     performs a single mining task on data from the specified table and produces a resulting table. The UDF is
     organized as a wrapper of an appropriate mining algorithm, which is implemented in C language and is
     parallelized based on OpenMP technology and thread-level parallelism. The library of such UDFs supports
     a cache of precomputed mining structures to reduce costs of computations. We compare performance of our
     approach with R data mining package, and experiments show efficiency of the proposed approach.
           Keywords: data mining, in-database analytics, PostgreSQL, thread-level parallelism, OpenMP.

                                                                  perimental evaluation of our approach are given in Sec-
 1 Introduction                                                   tion 3. Section 4 briefly discusses related works. Sec-
                                                                  tion 5 contains summarizing comments and directions
 Currently relational DBMSs remain the most popular               for future research.
 facility for storing, updating and querying structured
 data. At the same time, most of data mining algorithms
                                                                  2 Embedding of data mining functions into
 suppose processing of flat file(s) outside a DBMS.
 However, exporting data sets and importing of mining             PostgreSQL
 results impede analysis of large databases outside a
                                                                  2.1 Motivation example
 DBMS [18]. In addition to avoiding export-import bot-
 tleneck, an approach to data mining inside a DBMS                Our approach is aimed to provide a database application
 provides many benefits for the end-user like query op-           programmer with the library of data mining functions,
 timization, data consistency and security, etc.                  which could be run inside DBMS as it shown in Fig. 1.
     Existing approaches to integrating data mining with
 relational DBMSs include special data mining lan-
 guages and SQL extensions, implementation of mining
 algorithms in plain SQL and user-defined functions
 (UDFs) implemented in high-level language like C++.
 The latter approach could serve as a subject of applying
 parallel processing on modern many-core platforms.
     In this paper, we present an approach to data mining
 inside PostgreSQL open-source DBMS exploiting ca-
 pabilities of modern Intel MIC (Many Integrated
 Core) [2] platform. Our approach supposes a library of
 UDFs where each one of them performs a single mining
 task on data from the specified table and produces a             Figure 1 An example of using data mining function
 resulting table. The UDF is organized as a wrapper of            inside PostgreSQL
 an appropriate mining algorithm, which is implemented
 in C language and is parallelized for Intel MIC platform             In this example the mining function performs clus-
 by OpenMP technology and thread-level parallelism.               tering by Partitioning Around Medoids (PAM) [8] algo-
     The paper is structured as follows. We describe the          rithm for the data points from the specified input table
 proposed approach in the Section 2. The results of ex-           and saves results in output table (with respect to the
                                                                  specified number of the input table's columns, number
                                                                  of clusters and accuracy). An application programmer is
Proceedings of the XIX International Conference                   not obliged to export data to be mined from DBMS and
“Data Analytics and Management in Data Intensive                  import mining results back. At the same time here PAM
Domains” (DAMDID/RCDL’2017), Moscow, Russia,
October 10–13, 2017


                                                            114
encapsulates parallel implementation [24] based on                put table containing data to be mined. The rest parame-
OpenMP technology and thread-level parallelism.                   ters are specific to the task (e.g. number of clusters,
                                                                  accuracy, etc.).
2.2 Component structure
Fig. 2 depicts the component structure of our approach.
The pgMining is a library of data mining functions each
one of them is to be run inside PostgreSQL. The
mcMining is a library that exports data mining func-
tions, which are parallelized for modern many-core
platforms and are subject of wrapping by the respective
functions from pgMining library. Implementation of
pgMining library uses PostgreSQL's SPI (Server Pro-
gramming Interface), which provides low level func-
tions for data access.

                                                                  Figure 3 Interface and implementation schema of func-
                                                                  tion from Frontend
                                                                      In fact, Frontend's function wraps the respective
                                                                  UDF from Backend, which is loaded into PostgreSQL
                                                                  and executed as “INSERT INTO … SELECT …” query to
                                                                  save mining results in the specified table.

                                                                  2.4 Backend
                                                                  Fig. 4 depicts an example of Wrapper's function. Such a
                                                                  function is an UDF, which wraps a parallelized mining
                                                                  function from mcMining and performs as follows. First-
                                                                  ly, the function parses its input to form parameters to
                                                                  call mcMining function with. After that, the function
                                                                  checks if input table and/or auxiliary mining structures
Figure 2 Component structure of the proposed                      are in the cache maintained by Cache manager and then
approach                                                          load them if not. Finally, call of mcMining function
                                                                  with appropriate parameters is performed.
    The pgMining library consists of two following sub-
systems, namely Frontend and Backend, where the for-
mer provides presentation layer and the latter – data
access layer of concerns for an application programmer.
    The Frontend provides a set of functions for mining
inside PostgreSQL. Each function performs a single
mining task (e.g. clustering, classification, search pat-
terns, etc.) and produces a resulting table.
    The Backend consists of two modules, namely
Wrapper and Cache manager. The Wrapper provides
functions that serve as envelopes for the respective min-
ing functions from mcMining library. The Cache man-
ager supports cache of precomputed mining structures
to reduce costs of computations.
    The mcMining library provides a set of functions to
solve various data mining tasks in main memory and
exploits capabilities of Intel many-core platforms.
2.3 Frontend
An example of Frontend's function is given in Fig. 3.
Such a function connects to PostgreSQL, carries out
some mining task and returns exit code (0 in case of
success, otherwise negative error code). As a side ef-            Figure 4 Interface and implementation schema of
fect, the function creates a table with mining results.           function from Backend
The function's mandatory parameters are ID of Post-
greSQL connection, name of the input table, name of                  The Cache manager provides buffer pool to store
the output table and number of first left columns in in-          precomputed mining structures. Distance matrix is a


                                                            115
typical example of mining structure to be saved in                  misses.
cache. Indeed, distance matrix A=(aij) stores distances
between each pair of ai and aj elements in input data set.          3 Experimental evaluation
Being precomputed once, distance matrix could be used
many times to perform clustering or kNN-based classi-               3.1 Hardware, datasets and goals of experiments
fication with various parameters (e.g. number of clus-
                                                                    To evaluate the developed approach, we performed ex-
ters, number of neighbors, accuracy, etc.).
                                                                    periments on the Tornado SUSU supercomputer [9]
                                                                    whose node provides two different platforms, namely
                                                                    Intel Xeon CPU and Intel Xeon Phi coprocessor (cf.
                                                                    Tab. 1 for the specifications).

                                                                    Table 1 Specifications of hardware
                                                                          Specifications            CPU       Coprocessor
                                                                    Model, Intel Xeon               X5680      Phi SE10X
Figure 5 Interface of Cache manager module                          Cores                            2×6           61
                                                                    Frequency, GHz                   3.33          1.1
    The Cache manager exports the following two basic               Threads per core                   2            4
functions depicted in Fig. 5. The putObject function                Peak performance, TFLOPS        0.371        1.076
loads a mining structure specified by its ID, buffer                Memory, Gb                        24           18
pointer and size into cache. The getObject searches in              Cache, Mb                         12          30.5
cache for an object with the given ID. An ID of mining
structure is a string, which is made as concatenation of                In the experiments, we used datasets with the char-
input table's name and object's informational string (e.g.          acteristics depicted in Tab. 2.
“_distMatrix”).
                                                                    Table 2 Summary of datasets used in experiments
2.5 Library of parallel many-core algorithms
                                                                              Dataset            dimen-   #           #
Fig. 6 gives an example of function from mcMining
                                                                                                  sion clusters      data
library. Such a function encapsulates parallel implemen-
                                                                                                                    points,
tation through OpenMP technology and thread-level
                                                                                                                     ×210
parallelism for Intel many-core platforms.
                                                                    FCS Human [3]                 423        10       18
                                                                    MixSim [13]                     5        10       35
                                                                    US Census [12]                 67        10       35
                                                                    Power Consumption [10]          3        10       35

                                                                        In the experiments, we studied the following aspects
                                                                    of the developed approach. Firstly, we investigated the
Figure 6 Interface of function from mcMining library                speedup of mcPAM function to understand its scalabil-
                                                                    ity on both platforms depending on number of threads
    In this example, we use Partition Around Medoids                employed. Secondly, we evaluated the runtime of
(PAM) [8] clustering algorithm, which is used in a wide             mcPAM function to understand how the performance on
spectrum of applications where minimal sensitivity to               both platforms depends on number of data points and
noise data is required. The PAM provides such a prop-               what benefits could we derive from precomputations of
erty since it represents cluster centers by points of input         the distance matrix. Finally, we compared the perfor-
data set (medoids).                                                 mance of pgPAM function with implementation of
    The PAM firstly calculates distance matrix for the              PAM algorithm from R data mining package [13].
given data points. Then in the BUILD phase, an initial
clustering is obtained by the successive selection of               3.2 Results of experiments
medoids until the required number of clusters have been             The results of the first series of experiments on mcPAM
found. Next, in the SWAP phase the algorithm attempts               speedup are depicted in Fig. 7. On both platforms,
to improve clustering in accordance with an objective               mcPAM’s speedup is close to linear, when the number
function. However, for large and high-dimensional da-               of threads matches the number of physical cores the
tasets PAM's computations are very costly.                          algorithm is running on (i.e. 12 cores for Intel Xeon and
    In our previous research [24], we parallelize PAM               60 cores for Intel Xeon Phi, respectively).
for Intel Xeon CPU and Intel Xeon Phi coprocessor. In                   Speedup becomes sub-linear when the algorithm us-
order to perform best on Intel many-core platforms the              es more than one thread per physical core. The mcPAM
PAM's parallel version exploits modifications of loops              achieves up to 15× and 120× speedup on Intel Xeon and
to provide vectorization of calculations and chunk-by-              Intel Xeon Phi, respectively. Summing up, mcPAM
chunk data processing to decrease number of cache                   demonstrates good scalability on both platforms.


                                                              116
              (a) Intel Xeon CPU                   (b) Intel Xeon Phi coprocessor
Figure 7 Speedup of the mcPAM function


            (a) FCS Human dataset                       (b) MixSim dataset


            (c) US Census dataset                  (d) Power Consumption dataset
Figure 8 Performance of the mcPAM function


                                             117
             (a) FCS Human dataset                                                   (b) MixSim dataset


              (c) US Census dataset                                            (d) Power Consumption dataset
Figure 9 Performance of the pgPAM function (on Intel Xeon)

    Fig. 8 shows the results of the second series of ex-           computer. We plan to perform this study as further re-
periments on mcPAM performance. As was seen,                       search.
PAM's SWAP phase is performed better on Intel Xeon                    We can see that pgPAM significantly overtakes R's
while BUILD phase performance is equal for both plat-              PAM in both cases when one thread or the maximum
forms.                                                             number of threads are employed. Caching of distance
    Overall performance is better on Intel Xeon Phi than           matrix improves the performance up to 80 percent of
Intel Xeon when the algorithm deals with big dimen-                overall runtime (in case of high-dimensional dataset).
sionality dataset due to possibility of intensive vectori-
zation in calculations of distance matrix. Since calcula-          4 Related work
tions of distance matrix take from 15 to 80 percent of
overall runtime, we can derive substantial benefits from           The problem of integrating data analytics with relational
caching of the distance matrix.                                    DBMSs has been studied since data mining research
                                                                   originates.
    The results of the third series of experiments on
comparison performance of pgPAM and PAM from R                     Data mining query languages include DMQL [5],
data mining package are illustrated in Fig. 9. We carried          MSQL [7], MINE RULE operator [14] and Microsoft's
out these series of experiments on Intel Xeon platform             DMX [28].
only due to the following reason. Running PostgreSQL               There are many SQL implementations of data mining
on Intel MIC platform demands Intel Xeon Phi Knights               algorithms. SQL versions of classical clustering algo-
Landing (KNL), which is the next generation product                rithms include K-Means [16], EM [19], Fuzzy C-
from Intel and is bootable device. However, Intel Xeon             Means [15]. SQL versions of association rule mining
Phi KNL is not available yet at Tornado SUSU super-                algorithms include K-Way-Join, Three-Way-Join,


                                                             118
Subquery and Two-Group-Bys [25], Set-oriented Apri-                5 Conclusion
ori [29], Quiver [23], Propad [27]. Classification in-
cludes SQL implementations of decision trees [26],                 In this paper, we touch upon the problem of organizing
kNN [31] and Bayesian classification [21]. SQL is also             data mining inside a DBMS. We present an approach to
successfully used in mining applications for data with             implementation of in-database analytical functions for
“non-relational” nature as graphs, for instance in search          PostgreSQL that exploits capabilities of modern Intel
for frequent graphs [4], detection of cycles in graph [1],         many-core platforms.
graph partitioning [11, 22], etc.                                      Our approach supposes implementation of two li-
    User-defined functions-based approach. Integration             braries, namely pgMining and mcMining. The
of correlation, linear regression, PCA and clustering              pgMiningis a library of data mining functions each one
into the Teradata DBMS based on UDFs is proposed                   of them is to be run inside PostgreSQL. The mcMining
in [17]. There are two sets of UDFs that work in a sin-            is a library that exports functions to solve various data
gle table scan, that is an aggregate UDF to compute                mining tasks, which are parallelized for Intel MIC plat-
summary matrices and a set of scalar UDFs to score                 forms.
data sets. Experiments showed that UDFs are faster than                The pgMining consists of Frontend and Backend
SQL queries and UDFs are more efficient than C++,                  subsystems. The Frontend's function loads an UDF
due to long export times. In [20] UDFs implementing                from the Backend into PostgreSQL and executes it as
common vector operations were presented and it was                 “INSERT INTO … SELECT …” query to save mining re-
shown that UDFs are as efficient as automatically gen-             sults in a table. The Backend consists of Wrapper and
erated SQL queries with arithmetic expressions and                 Cache manager modules. The Wrapper provides func-
queries calling scalar UDFs are significantly more effi-           tions that serve as envelopes for the respective mcMin-
cient than equivalent queries using SQL aggregations.              ing mining functions. The Cache manager supports
    In-database mining frameworks. The ATLAS [30] is               cache of precomputed mining structures to reduce costs
a framework for in-database analytics, which provides              of computations.
SQL-like database language with user-defined aggre-                    Since our approach assumes hiding details of paral-
gates (UDAs) and table functions. The system's lan-                lel implementation from PostgreSQL, such an approach
guage processor translates ATLAS programs into C++                 could be ported to some other open-source DBMS (with
code, which is then compiled and linked with the data-             possible non-trivial but mechanical software develop-
base storage manager and user-defined external func-               ment effort).
tions. Authors presented ATLAS-based implementa-                       We have evaluated our approach on previously im-
tions of several data mining algorithms.                           plemented parallel clustering algorithm of mcMining
    The MADlib [6] is an open source library of in-                library and four real datasets. Experiments showed good
database analytical algorithms for PostgreSQL. The                 speedup and performance of the algorithm as well as
MADlib is implemented by a big team and provides                   our approach derive benefits from caching of precom-
many methods for supervised learning, unsupervised                 puted mining structures and overtakes R data mining
learning and descriptive statistics. The MADlib exploits           package.
UDAs, UDFs, and a sparse matrix C library to provide                   As future work, we plan to implement other mining
efficient representations on disk and in memory. As                algorithms formcMining library and conduct experi-
many statistical methods are iterative (i.e. they make             ments on Intel Xeon Phi Knights Landing platform.
many passes over a data set), authors wrote a driver
UDF in Python to control iteration in such a way that all
                                                                   Acknowledgement
large data movement is done within the database engine
and its buffer pool.                                               This work was financially supported by the Russian
    Comparison. In this paper, we suggest an approach              Foundation for BasicResearch (grant No. 17-07-00463),
to embedding data mining functions into PostgreSQL.                by Act 211 Government of the Russian Federation (con-
As some methods mentioned above our approach ex-                   tract No. 02.A03.21.0011) and by the Ministry of edu-
ploits UDFs. The difference from the previous works                cation and science of Russian Federation (government
includes the following. Our approach supposes parallel-            order 2.7905.2017/8.9).
ization of UDFs for many-core platform that current
DBMS is running on. All the parallelization details are            References
encapsulated in implementation of the UDF and are
                                                                    [1] Balachandran, R., Padmanabhan, S., Chakra-
hided from the DBMS, so our approach could be ported
                                                                        varthy, S.: Enhanced DB-Subdue: Supporting Sub-
to some other open-source DBMS (with possible non-
                                                                        tle Aspects of Graph Mining Using a Relational
trivial but mechanical software development effort). In
                                                                        Approach. In: W.K. Ng, M. Kitsuregawa, J. Li, K.
addition, our approach supposes a special module,
                                                                        Chang (eds.) Advances in Knowledge Discovery
which provides a cache of precomputed mining struc-
                                                                        and Data Mining, 10th Pacific-Asia Conf.,
tures and lets UDF know to reuse these structures to
                                                                        PAKDD 2006, Singapore, April 9–12, 2006, Proc.,
reduce costs of computations.
                                                                        Lecture Notes in Computer Science, 3918,


                                                             119
     pp. 673-678. Springer (2006). doi:10.1007/                  [13] Melnykov, V., Chen, W.C., Maitra, R.: Mixsim:
     11731139 77                                                      An R Package for Simulating Data to Study Per-
 [2] Duran, A., Klemm, M.: The Intel Many Integrated                  formance of Clustering Algorithms. J. of Statisti-
     Core Architecture. In: W.W. Smari, V. Zeljkovic                  cal Software, Articles 51 (12), pp. 1-25 (2012).
     (eds.) HPCS, pp. 365-366. IEEE (2012)                            doi:10.18637/jss.v051.i12
 [3] Engreitz, J.M., Jr., B.J.D., Marshall, J.J., Alt-           [14] Meo, R., Psaila, G., Ceri, S.: A New SQL-like Op-
     man, R.B.: Independent Component Analysis:                       erator for Mining Association Rules. In:
     Mining Microarray Data for Fundamental Human                     T.M. Vijayaraman, A.P. Buchmann, C. Mohan,
     Gene Expression Modules. J. of Biomedical In-                    N.L. Sarda (eds.) VLDB'96, Proc. of 22th Int.
     formatics. 43 (6), pp. 932-944 (2010)                            Conf. on Very Large Data Bases, September 3–6,
 [4] Garcia, W., Ordonez, C., Zhao, K., Chen, P.: Effi-               1996, Mumbai (Bombay), India, pp. 122-133.
     cient Algorithms Based on Relational Queries to                  Morgan Kaufmann (1996)
     Mine Frequent Graphs. In: A. Nica, A.S. Varde               [15] Miniakhmetov, R., Zymbler, M.: Integration of
     (eds.) Proc. of the Third Ph.D. Workshop on In-                  Fuzzy c-means Clustering Algorithm with Post-
     formation and Knowledge Management, PIKM                         greSQL Database Management System. Numeri-
     2010, Toronto, Ontario, Canada, October 30,                      cal Methods and Programming 13 (2(26)), pp. 46-
     2010, pp. 17-24. ACM (2010). doi:10.1145/                        52 (2012) (in Russian)
     1871902.1871906                                             [16] Ordonez, C.: Integrating k-means Clustering with
 [5] Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W.,                 a Relational DBMS Using SQL. IEEE Trans.
     Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanov-               Knowl. Data Eng. 18 (2), pp. 188-201 (2006).
     ic, N., Xia, B., Zaiane, O.R.: Dbminer: A System                 doi:10.1109/TKDE.2006.31
     for Mining Knowledge in Large Relational Data-              [17] Ordonez, C.: Building Statistical Models and Scor-
     bases. In: E. Simoudis, J. Han, U.M. Fayyad (eds.)               ing with UDFs. In: C.Y. Chan, B.C. Ooi, A. Zhou
     Proc. of the Second Int. Conf. on Knowledge Dis-                 (eds.) Proc. of the ACM SIGMOD Int. Conf. on
     covery and Data Mining (KDD-96), Portland, Or-                   Management of Data, Beijing, China, June 12–14,
     egon, USA, pp. 250-255. AAAI Press (1996)                        2007,     pp.     1005-1016.      ACM      (2007).
 [6] Hellerstein, J.M., Re, C., Schoppmann, F.,                       doi:10.1145/1247480.1247599
     Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S.,             [18] Ordonez, C.: Statistical Model Computation with
     Welton, C., Feng, X., Li, K., Kumar, A.: The                     UDFs. IEEE Trans. Knowl. Data Eng. 22 (12),
     MADlib Analytics Library or MAD Skills, the                      pp. 1752-1765                              (2010).
     SQL. PVLDB 5(12), pp. 1700-1711 (2012)                           doi:10.1109/TKDE.2010.44
 [7] Imielinski, T., Virmani, A.: MSQL: A Query Lan-             [19] Ordonez, C., Cereghini, P.: SQLEM: Fast Cluster-
     guage for Database Mining. Data Min. Knowl.                      ing in SQL Using the EM Algorithm. In: W. Chen,
     Discov. 3 (4), pp. 373-408 (1999). doi:                          J.F. Naughton, P.A. Bernstein (eds.) Proc. of the
     10.1023/A:1009816913055                                          2000 ACM SIGMOD Int. Conf. on Management
 [8] Kaufman, L., Rousseeuw, P.J.: Finding Groups in                  of Data, May 16–18, 2000, Dallas, Texas, USA,
     Data: An Introduction to Cluster Analysis. John                  pp. 559-570. ACM (2000). doi: 10.1145/
     Wiley (1990)                                                     342009.335468
 [9] Kostenetskiy, P., Safonov, A.: SUSU Supercom-               [20] Ordonez, C., Garcia-Garcia, J.: Vector and Matrix
     puter Resources. In: L. Sokolinsky, I. Starodubov                Operations Programmed with UDFs in a Relation-
     (eds.) PCT'2016, Int. Scientific Conf. on Parallel               al DBMS. In: P.S. Yu, V.J. Tsotras, E.A. Fox,
     Computational Technologies, Arkhangelsk, Rus-                    B. Liu (eds.). Proc. of the 2006 ACM CIKM Int.
     sia, March 29–31, 2016, pp. 561-573. CEUR                        Conf. on Information and Knowledge Manage-
     Workshop Proceedings. 1576 (2016)                                ment, Arlington, Virginia, USA, November 6–11,
[10] Lichman, M.: UCI Machine Learning Repository                     2006, pp. 503-512. ACM (2006). doi:10.1145/
     [http://archive.ics.uci.edu/ml/datasets/individual+              1183614.1183687
     household+electric+power+consumption]. Irvine,              [21] Ordonez, C., Pitchaimalai, S.K.: Bayesian Classi-
     CA: University of California, School of Infor-                   fiers Programmed in SQL. IEEE Trans. Knowl.
     mation and Computer Science (2013)                               Data Eng. 22 (1), pp. 139-144 (2010). doi:
[11] McCaffrey, J.D.: A Hybrid System for Analyzing                   10.1109/TKDE.2009.127
     Very Large Graphs. In: S. Latifi (ed.) Ninth Int.           [22] Pan, C., Zymbler, M.: Very Large Graph Partition-
     Conf. on Information Technology: New Genera-                     ing by Means of Parallel DBMS. In: B. Catania,
     tions, ITNG 2012, Las Vegas, Nevada, USA, 16–                    G. Guerrini, J. Pokorny (eds.) Advances in Data-
     18 April, 2012, pp. 253-257. IEEE Computer So-                   bases and Information Systems – 17th East Euro-
     ciety (2012). doi:10.1109/ITNG.2012.43                           pean Conf., ADBIS 2013, Genoa, Italy, September
[12] Meek, C., Thiesson, B., Heckerman, D.: The                       1–4, 2013. Proc., Lecture Notes in Computer Sci-
     Learning-curve Sampling Method Applied to                        ence, 8133, pp. 388-399. Springer (2013). doi:
     Model-based Clustering. J. of Machine Learning                   10.1007/978-3-642-40683-6 29
     Research. 2, pp. 397-418 (2002)


                                                           120
[23] Rantzau, R.: Frequent Itemset Discovery with                       Knowledge Management, INAP 2004, and 18th
     SQL Using Universal Quantification. In: R. Meo,                    Workshop on Logic Programming, WLP 2004,
     P.L. Lanzi, M. Klemettinen (eds.) Database Sup-                    Potsdam, Germany, March 4–6, 2004, Revised Se-
     port for Data Mining Applications: Discovering                     lected Papers, Lecture Notes in Computer Science,
     Knowledge with Inductive Queries, Lecture Notes                    3392, pp. 32-46. Springer (2004). doi:
     in Computer Science, 2682, pp. 194-213. Springer                   10.1007/11415763 3
     (2004). doi: 10.1007/ 978-3-540-44497-8 10                    [28] Tang, Z., Maclennan, J., Kim, P.P.: Building Data
[24] Rechkalov, T., Zymbler, M.: Accelerating Me-                       Mining Solutions with OLE DB for DM and XML
     doids-based Clustering with the Intel Many Inte-                   for Analysis. SIGMOD Record, 34 (2), pp. 80-85
     grated Core Architecture. In: 9th Int. Conf. on Ap-                (2005). doi:10.1145/1083784.1083805
     plication of Information and Communication                    [29] Thomas, S., Chakravarthy, S.: Performance Evalu-
     Technologies, AICT 2015, October 14–16, 2015,                      ation and Optimization of Join Queries for Associ-
     Rostov-on-Don, Russia. Proceedings, pp. 413-417                    ation Rule Mining. In: M.K. Mohania, A.M. Tjoa
     (IEEE,       2015).     doi:10.1109/ICAICT.2015.                   (eds.) Data Warehousing and Knowledge Discov-
     7338591                                                            ery, First Int. Conf., DaWaK '99, Florence, Italy,
[25] Sarawagi, S., Thomas, S., Agrawal, R.: Integrating                 August 30 – September 1, 1999, Proc., Lecture
     Association Rule Mining with Relational Database                   Notes in Computer Science, 1676, pp. 241-250.
     systems: Alternatives and Implications. Data Min.                  Springer (1999). doi:10.1007/ 3-540-48298-9 26
     Knowl. Discov. 4 (2/3), pp. 89-125 (2000).                    [30] Wang, H., Zaniolo, C., Luo, C.: ATLAS: A Small
     doi:10.1023/A:1009887712954                                        but Complete SQL Extension for Data Mining and
[26] Sattler, K., Dunemann, O.: SQL Database Primi-                     Data Streams. In: VLDB, pp. 1113-1116 (2003)
     tives for Decision Tree Classifiers. In: Proc. of the         [31] Yao, B., Li, F., Kumar, P.: K Nearest Neighbor
     2001 ACM CIKM Int. Conf. on Information and                        Queries and kNN-joins in Large Relational Data-
     Knowledge Management, Atlanta, Georgia, USA,                       bases (almost) for Free. In: F. Li, M.M. Moro,
     November 5–10, 2001, pp. 379-386. ACM (2001).                      S. Ghandeharizadeh, J.R. Haritsa, G. Weikum,
     doi:10.1145/502585.502650                                          M.J. Carey, F. Casati, E.Y. Chang, I. Manolescu,
[27] Shang, X., Sattler, K., Geist, I.: SQL Based Fre-                  S. Mehrotra, U. Dayal, V.J. Tsotras (eds.) Proc. of
     quent Pattern Mining with FPGrowth. In:                            the 26th Int. Conf. on Data Engineering, ICDE
     D. Seipel, M. Hanus, U. Geske, O. Bartenstein                      2010, March 1–6, 2010, Long Beach, California,
     (eds.) Applications of Declarative Programming                     USA, pp. 4-15. IEEE Computer Society (2010).
     and Knowledge Management, 15th Int. Conf. on                       doi:10.1109/ICDE.2010.5447837
     Applications of Declarative Programming and


                                                             121

</pre>