<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Analytical Data Management for Numerical Simulations</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ramon G. Costa, Fabio Porto</institution>
          ,
          <addr-line>Bruno Schulze</addr-line>
        </aff>
      </contrib-group>
      <fpage>210</fpage>
      <lpage>214</lpage>
      <abstract>
        <p>Numerical simulation of natural phenomena is being fostered by recent advances in powerful high processing computing platforms. Scientists in various areas, such as human cardiovascular system, model a phenomenon being studied through a set of mathematical equations. As scientists strive to obtain a more realistic simulation, a huge amount of data is produced. Unfortunately, there has been little work on supporting numerical simulation data management, which leaves simulation scientists with huge standard text les and complex analytical programs that eventually extract some meaningful information to validate scienti c hypotheses. Moreover, some analytical queries cannot, as well, be represented using none of the scienti c query languages. In this context, this paper tries to bridge this gap by raising some issues involved in numerical simulation data analysis. A representation for numerical simulation data is presented that considers a multidimensional model, for dimensional variables, and their corresponding physical quantities. A cloud service to interface with the numerical simulation data manager is proposed and its integration with the Neblina cloud middleware is explored.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Many scienti c areas are taking advantage of development in high processing
computing to model natural phenomena through in-silico simulation. The
process involved in modeling phenomena starts by observing phenomenon data,
and modeling the process through a set of mathematical di erential equations
that expresses the variation of selected physical quantities on time-space. Next,
the scientist may choose an appropriate numerical method that would solve the
equations and compute for each reference point the values for selected physical
quantities. Using state-of-art cluster platforms computer scientists strive to
obtain the most possible realistic simulations to compute weather forecasts, just
to name an application.</p>
      <p>
        In this paper we propose a novel strategy for scienti c simulation data
management based on database approach. We focus on the support to the analysis
of simulation results, taking a sample of a simulation output and loading it onto
a cloud data service modeled using a multi-dimensional array representation for
space and time to manage space-time multi-scale dependent data. We illustrate
our discussion with the simulation of the human cardiovascular system developed
at LNCC, INCT-MACC [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Literature review</title>
      <p>
        Recent studies have demonstrated the need for a more e cient storage, indexing
and processing strategies for scienti c data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        In Ogasawara [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], scienti c work ows are modeled as data intensive
applications. Parameter sweep experiments evaluate data represented as a set of
parameter con guration values. Moreover, typical work ow activities are identi ed
according to their data consumption and production rate and mapped to
algebraic operators, such as: Map, Reduce, Filter, and JoinQuery.
      </p>
      <p>
        The storage of numerical simulation data has been investigated by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Multidimensional scienti c data are modeled using an array data model. The authors
propose di erent types of array storage models according to array sparsity.
      </p>
      <p>
        Another important initiative in support for scienti c data management is
SciDB [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a database management system for scienti c applications. It o ers
a multidimensional model based on multiarray representation. Its functionality
includes: data versioning, uniform distribution of data across the nodes of a
cluster, and two query language interfaces: AFL and AQL languages.
      </p>
      <p>
        Considering the analysis of simulation data, one question is to determine the
past and future cone of information, as presented by Sowa [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Similar is the
study of causality in databases [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], in which data that contributes to a given
result is considered to cause, with a certain responsibility, such result.
      </p>
      <p>
        The Magellan project [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] explores the adoption of Cloud Computing in
scienti c applications. It evaluates some recent technologies in support for HPC,
such as: virtualization, MapReduce, Eucalyptus, and Hadoop. In addition, an
important issue that the project aims to evaluate is to confront the performance
of scienti c applications running on local clusters versus a cloud environment.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Challenges</title>
      <p>As discussed in section 1, we are interested in specifying a data management
service in support to numerical simulation results analysis. This section
investigates some possible representations for numeric simulation data and how to
extend database functionality to support scienti c data handling under it.
3.1</p>
      <sec id="sec-3-1">
        <title>Data representation</title>
        <p>Simulation data can be interpreted as composed by two sets of variables:
dimensional and physical quantities. Typically, the dimensional variables include space
and time. The space dimension refers to a mesh, which represents the topology
of the physical domain as a composition of simple geometric objects. A mesh is
represented by: a set of points, referring to the vertices of the geometric objects;
the set of edges linking the points and the faces of the model. Observe, yet,
that simulations may adopt di erent scales throughout the domain.
Furthermore, given a physical quantity, its value in a reference coordinate may change
through scales. A given simulation may be composed of data in di erent scales,
according to the precision requirements in di erent parts of the physical domain.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Using SciDB for storing simulation data</title>
        <p>
          Our rst e ort to represent numerical simulation data uses SciDB. A user
species multidimensional structures by providing the range values for each dimension
and a list of attribute values to compose a cell. In addition, a versioning
mechanism keeps historical values for each attribute. In this context, the following
mapping strategy has been de ned: for each set of physical quantities
corresponding to a phenomenon being simulated in a given scale; de ne the set of
dimensions; specify the list of physical quantities to be computed; create an
array having the dimensions as and attributes as . Using the AQL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we
would de ne the following schema for the 3D model of an artery.
CREATE ARRAY Geometry3D &lt;velocity: point3D, pression: double,
displacement: point3D&gt; [ simulations=0:*,1,0, t=0:500,500,0,
x=1:7000,1000,0, y=1:7000,1000,0, z=1:36000,1000,0]
        </p>
        <p>Through the AQL query language, we can use the following schema to
represent 1D (i.e di erent scale) of the human cardiovascular system:
CREATE ARRAY Geometry1D &lt;velocity: double, pression: double,
flow: double&gt; [ simulations=0:*,1,0, t=0:500,500,0 ]</p>
        <p>A 1D model represents each cross-section summarized in a single point. Thus,
each point contains the values for the physical quantities on each time step.
Additionally, It's refers to the di erent simulations over the same mesh. Observe
that the array data model adopted by SciDB enables the direct representation
of multidimensional objects produced in numerical simulations.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Analyzing numerical simulation data</title>
        <p>In order to support the analysis of numerical simulation output, algebraic
operators must be provided, eg.: drill down through scales - given a reference
coordinate in scale si, return the corresponding set of points in scale sj; AQL and
AFL languages do not have su cient mechanisms to support many analytical
queries. The challenge is to create new operations and functions to bridge this
gap, as well as, to coupling them to algebraic operators. An example is to obtain
the values for pression where the highest values are achieved:
aggregate(Geometry3D, max(pression));
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Analytical data management service</title>
        <p>From an architectural point of view, we expect to develop a service to interface
with the solvers - producing simulation data - and with scientists -
submitting analytical queries to the system. The Simulation Data Management Service
(SDMS) is responsible for providing such an interface as a cloud service, and
should manage the storage and retrieval of the simulation data making it
transparent to scienti c applications. The availability of the SDMS as a cloud service
fosters the collaborative use of simulation data among scientists of a same
research project and among di erent projects.</p>
        <p>
          An important aspect of the cloud service approach is the possibility to
explore elasticity [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Indeed, depending on the requested analysis, a huge amount
of data may be retrieved from the data storage device. In such a scenario, the
system may allocate extra memory for processing and freeing it as the
computation ends. Such a policy is not only a sign of altruism in a collaborative
environment, but may reduce the costs involved in supporting the computing
platform. Finally, the SDMS should o er to scienti c applications an API for
accessing its services using traditional programing languages, such as C++, Java
an Scripting languages such as Python.
3.5
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Scienti c computing in cloud</title>
        <p>
          Regarding the SDMS, an important issue is the research and development of
mechanisms that would enable its deployment in a private cloud as a Service
(SaaS) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In this context, we should highlight the software Neblina, presented
in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Neblina is a middleware developed at LNCC that o ers users an
interface to cloud resources. Through Neblina a cloud infrastructure, including an
application, may be accessed and managed. Typical functionalities include:
resources capacity provision, user management, virtualized and physical resources
management interface, remote access to the resources and their monitoring.
        </p>
        <p>The SDMS has been integrated into Neblina. This integration makes the
cloud environment transparent to SDMS, enabling for instance the activation of
its services. In Fig. 2 the architecture of the integrated environment is shown
and the numbers suggest a retrieving order.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We investigate the requirements involved in designing a data management service
in support for numerical simulation analysis. A multidimensional modeling
approach represents the dimensions used in referencing each individual simulation
point, and maps each point to its respective physical quantity values. We observe
that although the multi-array model adopted by SciDB enables the
implementation of the multidimensional representation, further extensions are required
to fully support numerical data representation and analysis requirements. In
particular, some analysis may require physical quantities to be computed over
di erent abstractions, such as computing their values in a face or edge of a
geometry object in a mesh. Moreover, supporting modeling through di erent scales
would require a relationship between multiple representations of the same
multidimensional space-time. Some proposed analytical queries can not, as well, be
represented using none of the SciDB query languages. New functions and user
data types would be needed to cope with those. We expect that this work will
provide a better understanding concerning the needs involved in analytical data
management for multidimensional numerical simulations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fernandes</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          , et al.:
          <article-title>Neblina - espacos virtuais de trabalho para uso em aplicaco~es cient cas</article-title>
          .
          <source>In: XIXX SBRC</source>
          . pp.
          <volume>965</volume>
          {
          <fpage>972</fpage>
          .
          <string-name>
            <surname>Campo</surname>
            <given-names>Grande</given-names>
          </string-name>
          ,
          <source>Brazil (Jun</source>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Laboratorio Nacional de Computaca~o Cient ca: Medicina Assistida por Computaca~o Cient ca (
          <year>Mar 2012</year>
          ), http://macc.lncc.br/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Meliou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Causality in databases</article-title>
          .
          <source>Data Eng. Bull</source>
          .
          <volume>33</volume>
          (
          <issue>3</issue>
          ),
          <volume>59</volume>
          {
          <fpage>67</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ogasawara</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al.:
          <article-title>An algebraic approach for data-centric scienti c work ows</article-title>
          .
          <source>In: 37th Intl Conference on VLDB. vol. 4</source>
          , pp.
          <volume>1328</volume>
          {
          <fpage>1339</fpage>
          . Seattle, USA (Aug
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>Magellan: experiences from a science cloud</article-title>
          .
          <source>In: 2nd intl workshop on Scienti c cloud computing</source>
          . pp.
          <volume>49</volume>
          {
          <fpage>58</fpage>
          . New York, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>SciDB</given-names>
            <surname>Inc.: SciDB User's Guide</surname>
          </string-name>
          (
          <year>2011</year>
          ), http://www.scidb.org/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sowa</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          :
          <article-title>Knowledge Representation: Logical, Philosophical, and Computational Foundations</article-title>
          .
          <source>Course Technology (Aug</source>
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Requirements for science data bases and scidb</article-title>
          .
          <source>In: Conference on Innovative Data Systems Research</source>
          . Asilomar, USA (Jan
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Zhang, Qi, et al.:
          <article-title>Cloud computing: state-of-the-art and research challenges</article-title>
          .
          <source>Journal of Internet Services and Applications</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>7</volume>
          {18 (May
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Zhang, Yi, et al.:
          <article-title>Storing matrices on disk: Theory and practice revisited</article-title>
          .
          <source>In: 37th Intl Conference on Very Large Data Bases. Seatle</source>
          , USA (Aug
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>