<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Mining Applied to SQL Queries: A Case Study for the SDSS SkyServer</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>M. Jordan Raddick Physics and Astronomy Dept. The Johns Hopkins University Baltimore</institution>
          ,
          <addr-line>Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rafael D. C. Santos LAC - INPE Sa ̃o Jose ́ dos Campos Sa ̃o Paulo -</institution>
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <fpage>66</fpage>
      <lpage>72</lpage>
      <abstract>
        <p>SkyServer, the portal for the Sloan Digital Sky Survey (SDSS) catalog, provides data access tools for astronomers and scientific education. One of the interfaces allows users to enter ad hoc SQL statements to query the catalog, and has logged over 280 million queries since 2001. This paper describes text mining techniques and preliminary results on mining the logs of the SQL queries submitted to SkyServer, along with what other applications we foresee for such procedure.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the increase in data collection and
generation, datasets are growing at an exponential pace,
making a real challenge to make available all
the data being produced. As a solution, some
large scientific datasets have been made available
through publicly accessible RDBMSes (Relational
Database Management Systems). In which
scientists and interested users can query and analyze
only the most relevant and up-to-date data for their
needs.</p>
      <p>The Sloan Digital Survey is one such case. It
makes available the largest astronomy survey to
date through SkyServer1, its Internet portal that
allows users and astronomers to query the database
and even perform data mining tasks using SQL
(Standard Query Language), the de facto standard
to query relational databases. The portal, in
operation since 2001, has proven to be extremely
popular, with over 1.5 billion page hits and almost 280
million SQL queries submitted.</p>
      <p>Since 2003, SkyServer has been logging every
query submitted to the portal. It collects access
information, such as timestamp, user ip address,
the tool used to submit the query, and the target
1http://skyserver.sdss3.org
data release (DR1, DR2, etc); and query
information, e.g. the SQL statement, query success
or failure and error message, number of rows
returned, elapsed time. This data can be used to
generate summarized access statistics, like queries per
month or data release query distribution over time,
as presented by Raddick et al. (2014). But for a
more in depth usage analysis, data has to be
processed and transformed, like Zhang et al. (2012),
which color codes SQL queries for visual
analysis and also presents a visual sky map of popular
searched areas.</p>
      <p>To further analyze such queries, this paper aims
to apply text mining techniques with the goal to
define a procedure to parse, clean and tokenize
statements into a weighted numerical
representation, which can then be fed into regular machine
learning algorithms for data mining.</p>
      <p>We proceed with an exploratory analysis, where
we project part of the historical queries into a low
dimensional representation and correlate the
results with sample templates defined in the
SkyServer help pages, a list of predefined queries
ranging from Basic SQL, showing simple SQL
structures; to specific examples on how to find
Stars, Galaxies or Quasars.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Text Mining and SQL Queries</title>
      <p>
        Text mining, or Knowledge Discovery in Texts
(KDT), is an extension to the traditional
Knowledge Discovery in Databases (KDD), the
nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in
data
        <xref ref-type="bibr" rid="ref1">(Fayyad et al., 1996)</xref>
        , but targeting
unstructured or semi-structured data instead of regular
databases, such as emails, full-text documents and
markup files (e.g., HTML and XML). It is a
multidisciplinary field involving, among others,
information retrieval and extraction, machine learning,
natural language processing, database technology
and visualization
        <xref ref-type="bibr" rid="ref11">(Tan, 1999)</xref>
        .
      </p>
      <p>SQL queries in this context can be seen as
minidocuments. As a well defined language, we can
leverage the structure provided by the language in
order to fine-tune and optimize the preprocessing
step of queries to suit the specific cases found. For
instance, there is no need for stop words removal,
and by analyzing the token type (table name,
column, variable, expression, constant, etc) we can
perform a different normalization or substitution.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>The methodology followed is the traditional KDD
process, comprising the following phases:
selection, preprocessing, transformation, data
mining, and interpretation/evaluation, with each phase
briefly discussed below.
3.1</p>
      <sec id="sec-3-1">
        <title>Selection</title>
        <p>For this paper, we used a normalized version of
the raw data made available by Raddick et al.
(2014) which analyzed a 10-year span of log data
(12/2002 to 09/2012), amounting to almost 195
million records and 68 million unique queries.</p>
        <p>As a proof-of-concept, we filtered the queries
to those coming from the last version of the online
SQL search tool (skyserver.sdss3.org), which only
allows SELECT statements and has a timeout of
10 minutes. The assumption was to have a dataset
with less variance and complexity. This filter also
restricted queries with errors and no rows returned,
resulting in a final dataset of 1.3 million queries.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Preprocessing and Transformation</title>
        <p>The main objective of the preprocessing phase is
to parse the text queries into a bag-of-words like
representation, but instead of just the set of tokens
present in each document, we also keep the count
of each token in that statement.</p>
        <p>As noted before, we can leverage the fact that
SQL is a structured language, by using a proper
parser and add a layer of metadata on top of each
token. Knowing what kind of token we are
processing, we can add specific actions for each token
type.</p>
        <p>Since SkyServer uses Microsoft SQL Server as
its RDBMS, we extended the readily available
.NET T-SQL parser library to build a custom one.
Other than normalizing case sensitivity, the
custom parser also removes constants (strings and
numbers), database namespaces, and aliases;
substitutes temporary table names, logical and
conditional operators for keywords; and qualified each
token with the SQL group, e.g. select, from,
where, groupby, orderby. Substitutions and filters
were performed with the intention to remove
tokens that are trivial (such as database namespaces)
or too specific (such as constants, table aliases, or
arithmetic operations), and thus, would be of
little contribution in discriminating or grouping each
query within the dataset.</p>
        <p>An example of the original statement and its
normalized version is shown in Figure 1. Figure
2 shows the final feature vector.</p>
        <p>SELECT p.objid, p.ra, p.dec,
p.u, p.g, p.r, p.i, p.z,
platex.plate, s.fiberid,
s.elodiefeh
FROM photoobj p,
dbo.fgetnearbyobjeq(1.62917,
27.6417, 30) n,
specobj s, platex
WHERE p.objid = n.objid</p>
        <p>AND p.objid = s.bestobjid
AND s.plateid =</p>
        <p>platex.plateid
AND class = ‘star’
AND p.r &gt;= 14
AND p.r &lt;= 22.5
AND p.g &gt;= 15
AND p.g &lt;= 23
AND platex.plate = 2803</p>
        <p>(a) Raw SQL query.
select objid ra dec u g r i z
plate fiberid elodiefeh
from photoobj fgetnearbyobjeq
specobj platex
where objid objid logic objid
bestobjid logic plateid
plateid logic class logic
r logic r logic g logic g
logic plate</p>
        <p>(b) Tokenized SQL.</p>
        <p>It is important to note that, since the parser is
strict, it can only process syntax valid statements.</p>
        <p>
          Lastly, we weight tokens according to its
frequency, so the most common or unusual rare
tokens are balanced to have more or less
contribution in its power of discrimination. One
of the most popular weighting scheme is the
TF*IDF (term frequency times inverse document
frequency), which assigns the largest weight to
terms that arise with high frequency in individual
documents, but are at the same time, relatively rare
in the collection as a whole
          <xref ref-type="bibr" rid="ref10">(Salton et al., 1975)</xref>
          .
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Data Mining</title>
        <p>
          On a general perspective from data analysis,
clustering is the exploratory procedure that organizes a
collection of patterns into natural groupings based
on a given association measure
          <xref ref-type="bibr" rid="ref4">(Jain et al., 1999)</xref>
          .
Intuitively, patterns within a cluster are much more
alike between each other, while being as
different as possible to patterns belonging to a different
cluster.
        </p>
        <p>
          In text mining, clustering can be used to
summarize contents of a document collection
          <xref ref-type="bibr" rid="ref7">(Larsen
and Aone, 1999)</xref>
          . So, with this idea in mind, what
kind of summarization could be done over the
historic SQL logs and how such summary would
compare to the predefined templates? For that,
we apply in this paper the Self-Organizing Map
(SOM) algorithm.
3.3.1
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Self-Organizing Maps</title>
        <p>
          Kohonen’s SOM
          <xref ref-type="bibr" rid="ref6">(Kohonen, 2001)</xref>
          is a neural
network algorithm that performs unsupervised
learning. It implements an orderly mapping of
highdimensional data into a regular low-dimensional
grid or matrix, reducing the original data
dimension while preserving topological and metric
relationships of the data
          <xref ref-type="bibr" rid="ref5">(Kohonen, 1998)</xref>
          .
        </p>
        <p>The SOM consist of M units located on a
regular grid. The grid is usually one- or
twodimensional, particularly when the objective is to
use the SOM for data visualization. Each unit j
has a prototype vector mj = [mj1, ..., mjd] in a
location rj , where d represent the dimension of a
data item. The map adjusts to the data by
adapting the values of its prototype vectors during the
training phase. At each training step t a sample
data vector xi = [xi1, ..., xid] is chosen and the
distances between xi and all the prototype
vectors are calculated to obtain the best-matching unit
(BMU). Units topologically close to the BMU are
then updated, moving their values towards xi.</p>
        <p>Distance calculation between the data vectors
and prototypes on the SOM can be calculated
using the Euclidean, Cosine or other metrics. The
neighborhood considered around the BMU can
be circular, square, hexagonal (to determine its
shape) and the distance between an unit and the
BMU can be weighted by a gaussian or
differenceof-gaussians function so units closest to the BMU
will be updated with different weights used by
units further from it. During training the weights
used for updating the units and the size of the
neighborhood can change according to several
different possible rules.</p>
        <p>
          The algorithm has two interesting
characteristics that suggest its use for data visualization:
quantization and projection. Quantization refers
to the creation of a set of prototype vectors which
reproduce the original data set as well as possible,
while projection try to find low dimensional
coordinates that tries to preserve the distribution from
the original high-dimensional data. The SOM
algorithm has proved to be especially good at
maintain the topology of the original dataset, meaning
that if two data samples are close to each other in
the grid, they are likely to be close in the original
high-dimensional space data
          <xref ref-type="bibr" rid="ref12">(Vesanto, 2002)</xref>
          .
        </p>
        <p>
          These features and the possible variations and
parameters of the Self-Organizing Map makes it
an interesting tool for exploratory data analysis,
particularly for visualization
          <xref ref-type="bibr" rid="ref12 ref8">(Morais et al., 2014;
Vesanto, 2002)</xref>
          . There are three main categories of
SOM applications for data visualization: 1)
methods that get an idea of the overall data shape and
detect possible cluster structures; 2) methods that
analyze the prototype vectors (as representatives
of the whole dataset) and 3) methods for analysis
of new data samples for classification and novelty
detection purposes.
        </p>
        <p>
          In this paper we use visualization methods
related to the second and third categories: the
UMatrix and plotting of existing data samples (in
our case, query prototypes or templates) over
the U-Matrix. The Unified Distance Matrix
(UMatrix) is one of the most used representations of
the trained SOM
          <xref ref-type="bibr" rid="ref2">(Gorricha and Lobo, 2012)</xref>
          . It is
a visual representation of the SOM to reveal
cluster structure of the data set. The approach colors
a grid according to the distance from each
vector prototype and its neighbors: dark colors are
chosen to represent large distances while light
colors correspond to proximity in the input space and
thus represent clusters.
3.4
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Data and Implementation</title>
        <p>After preprocessing, the initial 1.3 million selected
queries were compressed to 8,477 token sets with
2,103 features. As usual in a text mining context,
this dataset is extremely sparse, with only 0.008%
non-zero values.</p>
        <p>Templates were preprocessed in the same
manner as the token sets, also using the same idf
weights and scaling factors. Since some templates
have more than one version, the 45 selected
entries expanded to 51, denoted with a suffix letter
to indicate when it is a second or third alternative.</p>
        <p>Huang (2008) shows that the Euclidean distance
performs poorer than other distances in a text
clustering context. Hence, for this paper, we chose the
Cosine distance as the metric to find BMUs during
the SOM training.</p>
        <p>For this paper, we used a 30x30 SOM trained
for 45 epochs.
3.5</p>
      </sec>
      <sec id="sec-3-6">
        <title>Analysis</title>
        <p>We used two plots for an initial visual analysis,
the u-matrix, presented in Figure 3, in which
numbers indicate the template id over their respective
BMU, and a hitmap scatter plot, presented in
Figure 4, in which the size of the circles indicates the
number of token sets that elected that prototype its
BMU.</p>
        <p>From the figures above, we can see that the
trained SOM is able to well distribute the dataset
over prototypes and some areas can be visually
defined as clusters (regions of light colors circled by
dark points).</p>
        <p>In some cases, more than one template elected
the same prototype as their BMU, as we can check
from the legend. So after calculating a distance
matrix, we sorted the top 5 closest templates using
the Cosine distance, to see how they compare with
the trained SOM.</p>
        <p>Below, for each pair, we present their Cosine
distance using the Term Frequency representation,
and the Euclidean distance between their SOM
BMUs, along their name.</p>
        <p>1. Pair: 15 and 15b</p>
        <p>Distances: TF: 0.0 and SOM: 0.0
15: Splitting 64-bit values into two 32-bit
values
15b: Splitting 64-bit values into two 32-bit
values
2. Pair: 21b and 31</p>
        <p>Distances: TF: 0.0 and SOM: 0.0
21b: Finding objects by their spectral lines
31: Using the sppLines table
3. Pair: 22 and 43</p>
        <p>Distances: TF: 0.0205 and SOM: 0.0
22: Finding spectra by classification (object
type)
43: QSOs by spectroscopy
4. Pair: 39 and 39b</p>
        <p>Distances: TF: 0.1610 and SOM: 0.0
39: Classifications from Galaxy Zoo
39b: Classifications from Galaxy Zoo
5. Pair: 05 and 15</p>
        <p>Distances: TF: 0.1632 and SOM: 0.0
05: Rectangular position search
15: Splitting 64-bit values into two 32-bit
values</p>
        <p>The SQL queries presented that generated the
templates listed here are in the Appendix A.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>As a work in progress, further analysis is definitely
due, but from this very early results with the SOM,
further work is justified by noticing that close pair
of queries are being correctly mapped close to one
another.</p>
      <p>The Self-Organizing Map was selected as a
visualization tool due to its quantization and
projection properties. Other methods such as clustering
could be used, but preliminary tests showed that
the selection of algorithms and parameters is not
trivial, and the results were not as useful for
exploratory data analysis as the SOM and its visual
representations.</p>
      <p>Next steps include the evaluation of which
queries were similar (but not equal) to a specific
template, in order to identify queries that were
derived from a template; the analysis of clusters
of queries that do not have an associated
template, which could uncover possible good
candidates for new templates: popular queries that
can be included in the list presented in the
SkyServer as samples; and finally, the processing of
the whole log of queries to build a more
comprehensive dataset of the historical logs.</p>
      <p>This structured representation can also be
correlated with other features in the logs, as elapsed
time or error results, allowing other applications
of KDD, such as the running time or failure
prediction.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>Vitor Hirota Makiyama was supported by a grant
from Coordenac¸ a˜o de Aperfeic¸oamento de
Pessoal de Nivel Superior (CAPES).</p>
      <p>
        The implementation of the SOM algorithm in
this paper was based on the work of Vetti
        <xref ref-type="bibr" rid="ref13">gli
(2015</xref>
        ), licensed under the Creative Commons
Attribution 3.0 Unported License.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Appendix A. SkyServer SQL Templates</title>
      <p>Sample SQL templates available from SkyServer’s
help pages that are mentioned in this paper. The
list below comprises of the identification number
used in the exploratory analysis process, name and
category, a brief explanation, and the SQL
statement.
05: Rectangular position search (Basic SQL)
Rectangular search using straight coordinate
constraints
s e l e c t o b j i d , r a , dec
from p h o t o o b j
where ( r a between 1 7 9 . 5 and 1 8 2 . 3 )
and ( dec between 1.0 and 1 . 8 )
15: Splitting 64-bit values into two 32-bit values
(SQL Jujitsu)
The flag fields in the SpecObjAll table are
64bit but some analysis tools only accept 32-bit
integers. Here is a way to split them up using
bitmasks to extract the higher and lower 32
bits and dividing by a power of 2 to shift bits
to the right (since there is no bit shift operator
in SQL.)
s e l e c t t o p 10 o b j i d , r a , dec ,
f l a g s , o u t p u t t h e w h o l e b i g i n t</p>
      <p>a s a c h e c k
f l a g s &amp; 0 x 0 0 0 0 0 0 0 0 f f f f f f f f a s
f l a g s l o , g e t t h e l o w e r 32
b i t s w i t h a mask s h i f t t h e
b i g i n t t o t h e r i g h t 32 b i t s ,
t h e n u s e t h e same mask t o s g e t
u p p e r 32 b i t s
( f l a g s / power ( c a s t ( 2 a s b i g i n t ) ,
3 2 ) ) &amp; 0 x 0 0 0 0 0 0 0 0 f f f f f f f f a s
f l a g s h i
from p h o t o o b j
15B: Splitting 64-bit values into two 32-bit values
(SQL Jujitsu)
The hexadecimal version of above query
which can be used for debugging
s e l e c t t o p 10 o b j i d , r a , dec ,
c a s t ( f l a g s a s b i n a r y ( 8 ) ) a s f l a g s ,
c a s t ( f l a g s &amp; 0 x 0 0 0 0 0 0 0 0 f f f f f f f f a s
b i n a r y ( 8 ) ) a s f l a g s l o ,
c a s t ( ( f l a g s / power ( c a s t ( 2 a s b i g i n t
) , 3 2 ) ) &amp; 0 x 0 0 0 0 0 0 0 0 f f f f f f f f
a s b i n a r y ( 8 ) ) a s f l a g s h i
from p h o t o o b j
21B: Finding objects by their spectral lines
(General Astronomy)
This query selects red stars (spectral type K)
with large CaII triplet eq widths with low
errors on the CaII triplet equivalent widths.
s e l e c t s l . p l a t e , s l . mjd , s l . f i b e r ,
s l . c a i i k s i d e , s l . c a i i k e r r ,
s l . c a i i k m a s k , s p . f e h a d o p ,
s p . f e h a d o p u n c , s p . f e h a d o p n ,
s p . l o g g a d o p n , s p . l o g g a d o p u n c ,
s p . l o g g a d o p n
from s p p l i n e s a s s l
j o i n s p p p a r a m s a s s p</p>
      <p>on s l . s p e c o b j i d = s p . s p e c o b j i d
where f e h a d o p &lt; 3.5
and f e h a d o p u n c between 0 . 0 1 and</p>
      <p>0 . 5
and f e h a d o p n &gt; 3
22: Finding spectra by classification (object type)
(General Astronomy)
This sample query find all objects with
spectra classified as stars.
s e l e c t t o p 100 s p e c o b j i d
from s p e c o b j
where c l a s s = ’ s t a r ’
and z w a r n i n g = 0
31: Using the sppLines table (Stars)</p>
      <p>This sample query selects low metallicity
stars ([Fe/H] &lt;</p>
      <p>3.5) where more than three
different measures of feh are ok and are
averaged.
s e l e c t s l . p l a t e , s l . mjd , s l . f i b e r ,
s l . c a i i k s i d e , s l . c a i i k e r r ,
s l . c a i i k m a s k , sp . fehadop ,
sp . f e h a d o p u n c , sp . fehadopn ,
sp . loggadopn , sp . log gad opu nc ,
sp . l o g g a d o p n
from s p p l i n e s as s l
j o i n s p p p a r a m s as sp</p>
      <p>on s l . s p e c o b j i d = sp . s p e c o b j i d
where f e h a d o p &lt; 3.5
and f e h a d o p u n c between 0 . 0 1 and</p>
      <p>0 . 5
and f e h a d o p n &gt; 3
39: Classifications from Galaxy Zoo (Galaxies)
Find the weighted probability that a given
galaxy has each of the six morphological
classifications.
s e l e c t o b j i d , n v o t e ,
p e l as e l l i p t i c a l ,
p cw as s p i r a l c l o c k ,
p acw as s p i r a l a n t i c l o c k ,
p e d g e as edgeon ,
p dk as dontknow ,
p mg as merger
from z o o n o s p e c
where o b j i d = 1237656495650570395
39B: Classifications from Galaxy Zoo (Galaxies)
Find 100 galaxies that have clean photometry
at least 10 Galaxy Zoo volunteer votes and at
least an 80% probability of being clockwise
spirals.
s e l e c t t o p 100 g . o b j i d , z n s . n v o t e ,
z n s . p e l as e l l i p t i c a l ,
z n s . p cw as s p i r a l c l o c k ,
z n s . p acw as s p i r a l a n t i c l o c k ,
z n s . p e d g e as edgeon ,
z n s . p dk as dontknow ,
z n s . p mg as merger
from g a l a x y as g
j o i n z o o n o s p e c as z n s</p>
      <p>on g . o b j i d = z n s . o b j i d
where g . c l e a n =1
and z n s . n v o t e &gt;= 10
and z n s . p cw &gt; 0 . 8
43: QSOs by spectroscopy (Quasars)</p>
      <p>The easiest way to find quasars is by
finding objects whose spectra have been
classified as quasars. This sample query searches
the SpecObj table for the IDs and redshifts of
objects with the class column equal to ’QSO’
s e l e c t t o p 100 s p e c o b j i d , z
from s p e c o b j
where c l a s s = ’ qso ’
and z w a r n i n g = 0</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Usama M. Fayyad</surname>
            , Gregory Piatetsky-Shapiro,
            <given-names>Padhraic</given-names>
          </string-name>
          <string-name>
            <surname>Smyth</surname>
            , and
            <given-names>Ramasamy</given-names>
          </string-name>
          <string-name>
            <surname>Uthurusamy</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Advances in Knowledge Discovery and Data Mining</article-title>
          . AAAI Press / The MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Gorricha</surname>
          </string-name>
          and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Lobo</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Improvements on the visualization of clusters in geo-referenced data using Self-Organizing Maps</article-title>
          .
          <source>Computers &amp; Geosciences</source>
          ,
          <volume>43</volume>
          :
          <fpage>177</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Anna</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Similarity Measures for Text Document Clustering</article-title>
          . In New Zealand Computer Science Research Student Conference, pages
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Anil K. Jain</surname>
            ,
            <given-names>M. Narasimha</given-names>
          </string-name>
          <string-name>
            <surname>Murty</surname>
            , and
            <given-names>P. Joseph</given-names>
          </string-name>
          <string-name>
            <surname>Flynn</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Data clustering: a review</article-title>
          .
          <source>ACM computing surveys (CSUR)</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ):
          <fpage>264</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Teuvo</given-names>
            <surname>Kohonen</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>The self-organizing map</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>21</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Teuvo</given-names>
            <surname>Kohonen</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Self-organizing maps</article-title>
          , volume
          <volume>30</volume>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Bjornar</given-names>
            <surname>Larsen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chinatsu</given-names>
            <surname>Aone</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Fast and Effective Text Mining Using Linear-Time Document Clustering</article-title>
          .
          <source>In Proceedings of the 5th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          , volume
          <volume>5</volume>
          , pages
          <fpage>16</fpage>
          -
          <lpage>22</lpage>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Alessandra M. M. Morais</surname>
            , Marcos G. Quiles, and
            <given-names>Rafael D. C.</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Icon and Geometric Data Visualization with a Self-Organizing Map Grid</article-title>
          .
          <source>In Computational Science and Its Applications - ICCSA</source>
          <year>2014</year>
          , volume
          <volume>8584</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>562</fpage>
          -
          <lpage>575</lpage>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>M. Jordan Raddick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ani R. Thakar</surname>
            , Alexander S. Szalay, and
            <given-names>Rafael D. C.</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Ten Years of SkyServer I: Tracking Web and SQL e-Science Usage</article-title>
          .
          <source>Computing in Science &amp; Engineering</source>
          ,
          <volume>16</volume>
          (
          <issue>4</issue>
          ):
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          , November.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Ah-Hwee Tan</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Text Mining: The state of the art and the challenges</article-title>
          .
          <source>Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases</source>
          ,
          <volume>8</volume>
          :
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Juha</given-names>
            <surname>Vesanto</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Data Exploration Process Based on the Self-Organizing Map</article-title>
          .
          <source>Ph.D. thesis</source>
          , Helsinki University of Technology.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Vettigli</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>MiniSom: minimalistic and NumPy based implementation of the Self Organizing Maps</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Jian</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Chaomei Chen, Michael S. Vogeley, Danny Pan,
          <string-name>
            <surname>Ani R. Thakar</surname>
            , and
            <given-names>M. Jordan</given-names>
          </string-name>
          <string-name>
            <surname>Raddick</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>SDSS Log Viewer: visual exploratory analysis of large-volume SQL log data</article-title>
          .
          <volume>8294</volume>
          :
          <fpage>82940D</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>