<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carlos Ordonez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiqun Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Houston</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We study the serial and parallel computation of Γ (Gamma), a comprehensive data summarization matrix for linear Gaussian models, widely used in big data analytics. Computing Gamma can be reduced to a single matrix multiplication with the data set, where such multiplication can be evaluated as a sum of vector outer products, which enables incremental and parallel computation, essential features for scalable computation. By exploiting Gamma, iterative algorithms are changed to work in two phases: (1) Incremental-parallel data set summarization (i.e. in one scan and distributive); (2) Iteration in main memory exploiting the summarization matrix in intermediate matrix computations (i.e. reducing number of scans). Most intermediate computations on large matrices collapse to computations based on Gamma, a much smaller matrix. We present specialized database algorithms for dense and sparse matrices, respectively. Assuming a distributed memory model (i.e. shared-nothing) and a larger number of points than processing nodes, we show Gamma parallel computation has close to linear speedup. We explain how to compute Gamma with existing database systems processing mechanisms and their impact on time complexity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Machine learning models generally require demanding iterative computations on
matrices. This problem becomes significantly more difficult when input matrices
(i.e. the data set) are large, residing on secondary storage, and when they need
to be processed in parallel. We propose the Γ (Gamma) data summarizaton
matrix [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for linear Gaussian models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. From a computational perspective Γ
has outstanding features: it produces an exact solution, it can be incrementally
computed in one pass and it can be computed in parallel, with low
communication overhead in a distributed memory model (i.e. shared-nothing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). We show
computing Γ can be reduced to a single matrix self-multiplication. We study
time and space complexity and parallel speedup of such multiplication.
as a wide rectangular matrix. Supervised (predictive) models require an extra
attribute. For regression models X is augmented with a (d + 1)th dimension
containing an output variable Y . On the other hand, for classification models, there
is an extra discrete attribute G, where G is in general binary (e.g. false/true,
bad/good). We use i = 1 . . . n and j = 1 . . . d as matrix subscripts. The model, a
“bag” of vectors and matrices, is called Θ. Therefore, Θ can represent principal
component analysis (PCA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], linear regression [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], K-means clustering [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
Naive Bayes classification [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], among others.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>The Gamma Data Summarization Matrix</title>
      <p>
        We introduce a general augmented matrix Z, by prefixing it with a row of 1s
and appending a [d + 2]th row to X, containing vector Y (in transposed form).
Since X is d × n, Z is (d + 2) × n, where row [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are 1s and row [d + 2] is Y . For
classification and clustering models X is partitioned using a discrete attribute
G (e.g. class 0/1, cluster 1 . . . k). In this paper, we focus on Z extended with Y .
      </p>
      <p>
        Multidimensional sufficient statistics [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] summarize statistical properties of
a data set: n = |X|, L = Pn
      </p>
      <p>i=1 xi, Q = XXT . Intuitively, n counts points, L is a
simple “linear” sum (a vector) and Q is a “quadratic” matrix. We now introduce
Γ , our main contribution, which generalizes previous sufficient statistics.
Γ = 
 n LT 1T · Y T </p>
      <p>L Q XY T 
Y · 1 Y XT Y Y T</p>
      <p> n P xT P yi 
=  P xi P xiixiT P xiyi </p>
      <p>P yi P yixiT P yi2</p>
      <p>
        The fundamental property of Γ is that it can be computed by a single matrix
multiplication with Z: Γ = ZZT . Notice ZZT 6= ZTZ, a property of Gramian
matrices [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Matrix Γ is fundamentally the result of “squaring” matrix Z. Γ is
much smaller than X for large n: O(d2) O(dn), symmetric and computable
via vector outer products. The property below gives two equivalent forms to
summarize the data set: one based on matrix-matrix multiplication and a second
one based on vector-vector multiplication, which is more efficient for incremental
and parallel processing.
      </p>
      <p>Proposition: Γ can be equivalently computed as follows:
Γ = ZZT =
n
X zi · ziT .
i=1
(1)
4</p>
    </sec>
    <sec id="sec-3">
      <title>Serial Computation</title>
      <p>Recall from Section 2 that Θ represents a model. We start by presenting the
standard iterative algorithm to compute a model Θ in two phases. Phase 1:
initialize Θ; Phase 2: Iterate until covergence (i.e. until Θ does not change)
reading X from secondary storage at every iteration (once or several times). Our
proposed equivalent optimized iterative 2-phase algorithm based on Γ follows.
Phase 1: Compute Γ , updating Γ in main memory reading X from secondary
storage. Initialize Θ from Γ ; Phase 2: Iterate until convergence exploiting Γ in
main memory in intermediate matrix computations.</p>
      <p>Lemma (Succinctness–asymptotically small size): Given a large data set X
where d n, matrix Γ size is O(d2) O(dn) as n → ∞.</p>
      <p>Our time and space analysis is based on the 2-phase algorithm introduced
above. Phase 1, which corresponds to the summarization matrix operator, is the
most important. Computing Γ with a dense matrix multiplication is O(d2n).
On the other hand, computing Γ with a sparse matrix multiplication is O(k2n)
for the average case assuming k entries from xi are non-zero; assuming X is
hyper-sparse k2 = O(d) then the time is O(dn) on average. In Phase 2 we take
advantage of Γ to accelerate computations involving X. Since we are computing
matrix factorizations derived from Γ time is Ω(d3), which for a dense matrix
it may approach O(d4), when the number of iterations in the factorization
numerical method is proportional to d. In short, time complexity for Phase 2 for
the models we consider does not depend on n. I/O cost is just reading X with
the corresponding number of I/Os in time O(dn) for dense matrix storage and
O(kn) for sparse matrix storage.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Parallel Computation</title>
      <p>
        Let N be the number of processing nodes, under a distributed memory
architecture (shared-nothing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). That is, each node has its own main memory and
secondary storage. We assume d n and N n, but d is independent from
N : d N or N d. Our parallel algorithm follows. Phase 1 with N nodes:
Compute Γ in parallel, updating Γ in main memory reading X from secondary
storage. Phase 2 with 1 node: Iterate until convergence exploiting Γ in main
memory in intermediate matrix computations to compute Θ.
      </p>
      <p>Recall we assume d &lt; n and n → ∞. For Phase 1 each xi ∈ X is hashed to
N processors. Then our matrix operator can compute each outer product zi · ziT
in parallel, resulting in O(d2n/N ) work per processor for a dense matrix and
O(dn/N ) work per processor for a sparse matrix. That is, both dense and sparse
matrix multiplication algorithms become optimal as N increases or n → ∞.
Given its importance from a time complexity point of view, we present the
following result as a theorem:</p>
      <p>Theorem (linear speedup): Let Tj be processing time using j nodes, where
1 ≤ j ≤ N . Under our main assumption and Θ fits in main memory then our
optimized algorithm gets close to optimal speedup T1/TN ≈ O(N ).</p>
      <p>For completeness, we provide an analysis of a parallel system where N is
large. That is, N = O(n).</p>
      <p>Lemma (sequential communication bottleneck): Assume only one node can
receive partial summarizations. Then time complexity with N messages to a single
coordinator is O(d2n/N + N d2).</p>
      <p>Lemma (improved communication: global summarization with a balanced binary
tree): Assume partial results Γ [I] can be sent to other workers in a binary tree
fashion. Then T (d, n, N ) ≥ O(d2N +log2(N )d2) (a lower communication bound).
6</p>
    </sec>
    <sec id="sec-5">
      <title>Computing Γ in a Database System</title>
      <p>Here we connect our matrix-based parallel computation with a database system
with: (1) relational queries; (2) extending it with matrix functions. We assume
X can be evenly partitioned by i, a reasonable assumption when d n.</p>
      <p>We now treat X as a table in order to process with relational algebra, the
formalization of SQL. In order to simplify presentation we use X as the table
name and we treat Y as another dimension. There are two major alternatives:
(1) Defining a table with d columns, plus primary key i X(i, X1, X2, . . . , Xd),
ideal for dense matrices. (2) Defining a table with triples, using the matrix
subscripts as primary key: X(i, j, v), where i means point id, j dimension
subscript and v is the matrix entry value, excluding zeroesm, where v = 0 (i.e.
sparse matrix). Alternative (1) requires expressing matrix product as d2
aggregations, which requires a wrapper program to generate the query. Notice
Xj is a column name: we cannot assume there is subscript-based access.
Alternative (2) can express the computation as a single query, making the
solution expressible within relational algebra and it is natural for sparse matrices.
In the following query X1 and X2 are “alias” of X because it is referenced
twice: Γ = πX1.j,X2.j,sum(X1.v∗X2.v)(X1 1X1.i=X2.i X2). The time complexity
to evaluate the Gamma query depends on the join algorithm. Therefore, it can
be: O(d2n) (hash join, balanced buckets), O(d2nlog(n)) (sort-merge join, merge
join), to O(d2n2) (nested loop join). These bounds can be tighter if the join
operation can be avoided using a subscript based mechanism in main memory
when Γ fits in main memory.</p>
      <p>
        The previous query solution requires an expensive join (1), which can be
slow. Therefore, our goal is to eliminate the join operation by enabling
efficient subscript-based access. Our proposal is to introduce a matrix function
in an extended database model, where tuples are transfomed into vectors and
a matrix can be the result of the query. This function bridges relational
tables on one side with matrices on the other side. Let Gamma() be an
aggregation which returns a (d + 2) × (d + 2) matrix. There are two elegant solutions:
(1) Γ = Gamma(X1, X2, . . . , Xd)(X) for X with a “horizontal” schema. (2)
Γ = Gamma(X) for a “vertical” schema, where the function assume X has
the schema defined above. In this case time complexity ranges from O(d2n) to
O(d2n log(n)), depending on the aggregation algorithm. The price we are paying
is that the query is no longer expressible within relational algebra and therefore
it is not feasible to treat Γ as a new relational operator. This second mechanism
corresponds to so-called SQL UDFs [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ] or user-defined array operators [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          .
          <article-title>Parallel database systems: the future of high performance database systems</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>35</volume>
          (
          <issue>6</issue>
          ):
          <fpage>85</fpage>
          -
          <lpage>98</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G.H.</given-names>
            <surname>Golub</surname>
          </string-name>
          and
          <string-name>
            <surname>C.F. Van Loan</surname>
          </string-name>
          .
          <article-title>Matrix computations</article-title>
          . Johns Hopkins University, 3rd edition,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          .
          <article-title>Integrating K-means clustering with a relational DBMS using SQL</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering (TKDE)</source>
          ,
          <volume>18</volume>
          (
          <issue>2</issue>
          ):
          <fpage>188</fpage>
          -
          <lpage>201</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          .
          <article-title>Statistical model computation with UDFs</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering (TKDE)</source>
          ,
          <volume>22</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1752</fpage>
          -
          <lpage>1765</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Garcia-Alvarado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Baladandayuthapani</surname>
          </string-name>
          .
          <article-title>Bayesian variable selection in linear regression in one pass for large datasets</article-title>
          .
          <source>ACM TKDD</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pitchaimalai</surname>
          </string-name>
          .
          <article-title>Bayesian classifiers programmed in SQL</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering (TKDE)</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <fpage>139</fpage>
          -
          <lpage>144</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Cabrera</surname>
          </string-name>
          .
          <article-title>The Gamma matrix to summarize dense and sparse data sets for big data analytics</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering (TKDE)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Roweis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          .
          <article-title>A unifying review of linear Gaussian models</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>11</volume>
          :
          <fpage>305</fpage>
          -
          <lpage>345</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.J.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Paulson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pavlo</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rasin</surname>
          </string-name>
          .
          <article-title>MapReduce and parallel DBMSs: friends or foes? Commun</article-title>
          . ACM,
          <volume>53</volume>
          (
          <issue>1</issue>
          ):
          <fpage>64</fpage>
          -
          <lpage>71</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M. Stonebraker</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brown</surname>
            , D. Zhang, and
            <given-names>J. Becla.</given-names>
          </string-name>
          <article-title>SciDB: A Database Management System for Applications with Complex Analytics</article-title>
          .
          <source>Computing in Science and Engineering</source>
          ,
          <volume>15</volume>
          (
          <issue>3</issue>
          ):
          <fpage>54</fpage>
          -
          <lpage>62</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>