<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ProcessFast, a Java framework for the development of concurrent and distributed applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Esuli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tiziano Fagni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Scienza e Tecnologie dell'Informazione Consiglio Nazionale delle Ricerche 56124 Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Today, any application that requires processing information gathered from the Web will likely require a parallel processing approach to be able to scale. While writing such applications, the developer should be able to exploit several types of parallelism paradigms in a natural way. Most of the available development tools are focused on just one of these parallelism types, e.g. the data parallelism, stream processing, etc. In this paper, we introduce ProcessFast, a Java framework for the development of concurrent/distributed applications, designed to allow the developer to integrate both stream/task parallelism and data parallelism in the same application and to seamlessly combine solutions to sub-problems where each solution exploits a speci c programming model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Most of the frameworks for parallel and distributed computing focus on a single
parallel computing paradigm, e.g., stream processing [
        <xref ref-type="bibr" rid="ref1 ref5 ref6">6, 1, 5</xref>
        ], map-reduce [
        <xref ref-type="bibr" rid="ref3 ref7 ref8">3, 7,
8</xref>
        ], task parallelism [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Complex applications could bene t from using a
combination of di erent approaches. For example, a text mining system that classi es
a stream of tweets can be decomposed into a set of parallel tasks (crawling, NLP
processing, indexing, classi cation, aggregation, report) connected via streams,
with each task possibly exploiting data parallelism to e ciently apply the same
computation to batches of data. Implementing such a system usually requires to
combine several frameworks, with the added burden of implementing a
communication layer among them. Moreover, the target architecture of the system, e.g.,
a single multi-core machine or a distributed environment, will result in di erent
choices of frameworks, since each framework is usually targeted to produce its
maximum e ciency on a speci c architecture. Changing the runtime
architecture will often require to change the underlying parallel computing framework1,
with non trivial implementation cost.
1 Many tools have a standalone mode that simulates a distributed environment on a
single machine, but it is mainly thought as a development tool, not for production
use.
      </p>
      <p>
        We are developing ProcessFast (PF), a Java framework that aims at
providing a seamless integration of di erent parallel computing models, by combining
the functionalities provided by di erent parallel processing frameworks into an
homogeneous API. PF does not aim at implementing yet another parallel
computing framework, its purpose is to act as a higher-level API that abstracts the
functionalities of current parallel computing frameworks, allowing to write
scalable applications that once developed can be deployed, and executed e ciently,
on di erent architectures by only switching the PF runtime layer that
implements the API on the better suited frameworks. This paper introduces the PF
main concepts, structures, and functionalities. The current status of the
development consists of the PF API and a rst implementation of the API on a single
machine architecture mainly based on wrapping the GPars library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
2
      </p>
      <p>ProcessFast, an overview of the programming model
The PF API2 de nes how the developer can write applications that use and
combine task/stream parallelism and data parallelism. It provides a lock-free
programming model in which an application can be de ned in terms of a set of
asynchronous processes that are able to intercommunicate.
2.1</p>
      <sec id="sec-1-1">
        <title>TaskSet, a logical container and a reusable \black box"</title>
        <p>As shown in Figure 1, the topology of an application is de ned inside a TaskSet.
A TaskSet is a logical container of Tasks (detailed in Section 2.2) and Connectors
(detailed in Section 2.3). A TaskSet is also a Task, and thus can be used as a
\black box" inside another TaskSet. A TaskSet allows to implement task/stream
parallelism through the use of Connectors to let the contained Tasks
communicate together. Barriers can be used to synchronize the execution of Tasks.
2 ProcessFast public repository: https://github.com/tizfa/processfast-api</p>
      </sec>
      <sec id="sec-1-2">
        <title>Task, a stateful and asynchronous logical process</title>
        <p>
          A Task is the minimal unit of execution in PF. Every Task is a stateful logical
process which runs asynchronously with respect to the other tasks in the
application and can be created in multiple instances inside a single program. A Task
is able to communicate through Connectors only with the other Tasks de ned
in the same TaskSet, or externally using the input and output Connectors of the
parent TaskSet (which de ne the entry and exit points of the TaskSet taken as
a black box). As shown in Figure 2, a Task internally can exploit the data
parallelism by operating concurrently on PartitionableDatasets (PDs). The concept
of PD is very similar to that of RDD in Spark [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. A PD is a read-only data
structure which can be split in n partitions, where every partition can then be
processed concurrently as a data stream. PDs can be processed by applying two
types of operations, following a map-reduce model:
{ transformations: each item in the input stream is transformed in some way
resulting in a new item in the output stream (e.g. T1 and T2 in Figure 2).
{ actions: items from input data stream are collected and processed to produce
an aggregated result that is returned to the caller task (e.g. A1 in Figure 2).
2.3
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Connectors, interprocess communication based on queues</title>
        <p>A Connector is a shared queue, belonging to a TaskSet, uniquely identi ed by
a name. Connectors can have single or multiple Tasks attached as readers and
writers. A Task can consume exclusively an item read from a Connector ( rst
come, rst served, data parallelism) or the Connector can provide to each reading
Task the same complete sequence of items as written to the connector
(broadcasting, task parallelism). Write operations are generally asynchronous (a Task
after posting a message on a speci c connector can continue its computation)
while the read are synchronous and blocking, though is it also possible to de ne
synchronous read/write connections.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Shared permanent storage</title>
        <p>The PF API de nes a shared permanent storage which allows direct access to
some high levels data structures. The main purpose of the permanent storage is
to reduce the amount of information transmitted through connectors and to rely
instead on solutions that are best t for the target architectures. The supported
data structures are:
{ Array : a direct access unidimensional array. The structure supports PD
views, thus it is ready for parallel processing.
{ Matrix : a direct access bidimensional array. The structure supports PD
views, allowing parallel processing by rows or by columns.
{ Dictionary : a key/value collection.</p>
        <p>{ DataStream: a byte stream used to load/store data.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Conclusions</title>
      <p>
        We have introduced the main ideas behind PF, whose grand goals are (i) to
allow developers to seamlessly integrate di erent parallel computing models into
their applications, and (ii) to implement a write once, run (e ciently) anywhere
model for parallel/distributed applications. We have recently completed the rst
implementation of the ProcessFast API based on Groovy GPars library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. One
of our rst tests, on a eight cores CPU, obtained a ve-fold improvement against
a sequential implementation of matrix multiplication using matrixes with a size
of 10'000 by 10'000. Future development will focus on implementing the API on
a distributed architecture and on running comparative tests with the well-known
alternatives (e.g., Hadoop, Spark, Storm).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Apache</given-names>
            <surname>Samza</surname>
          </string-name>
          . http://samza.apache.org/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Gpars:
          <article-title>Groovy parallel system</article-title>
          . http://gpars.codehaus.org/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <article-title>Je rey Dean and Sanjay Ghemawat</article-title>
          .
          <source>Mapreduce: Simpli ed data processing on large clusters. Commun. ACM</source>
          ,
          <volume>51</volume>
          (
          <issue>1</issue>
          ):
          <volume>107</volume>
          {
          <fpage>113</fpage>
          ,
          <year>January 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Dinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.B.</given-names>
            <surname>Larkins</surname>
          </string-name>
          , Jarek Nieplocha, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Sadayappan</surname>
          </string-name>
          .
          <article-title>Scioto: A framework for global-view task parallelism</article-title>
          .
          <source>In Parallel Processing</source>
          ,
          <year>2008</year>
          . ICPP '
          <volume>08</volume>
          . 37th International Conference on, pages
          <volume>586</volume>
          {
          <fpage>593</fpage>
          ,
          <string-name>
            <surname>Sept</surname>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gianmarco De Francisci Morales</surname>
            and
            <given-names>Albert</given-names>
          </string-name>
          <string-name>
            <surname>Bifet</surname>
          </string-name>
          .
          <article-title>Samoa: Scalable advanced massive online analysis</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>16</volume>
          :
          <fpage>149</fpage>
          {
          <fpage>153</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ankit</given-names>
            <surname>Toshniwal</surname>
          </string-name>
          , Siddarth Taneja, Amit Shukla, Karthik Ramasamy,
          <string-name>
            <surname>Jignesh M. Patel</surname>
          </string-name>
          , Sanjeev Kulkarni,
          <string-name>
            <surname>Jason Jackson</surname>
            ,
            <given-names>Krishna</given-names>
          </string-name>
          <string-name>
            <surname>Gade</surname>
            , Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and
            <given-names>Dmitriy</given-names>
          </string-name>
          <string-name>
            <surname>Ryaboy</surname>
          </string-name>
          .
          <article-title>Storm@twitter</article-title>
          .
          <source>In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14</source>
          , pages
          <fpage>147</fpage>
          {
          <fpage>156</fpage>
          , New York, NY, USA,
          <year>2014</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Tom</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Hadoop: the de nitive guide: the de nitive guide. "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc."</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Matei</given-names>
            <surname>Zaharia</surname>
          </string-name>
          , Mosharaf Chowdhury,
          <string-name>
            <surname>Michael J Franklin</surname>
            ,
            <given-names>Scott</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
            , and
            <given-names>Ion</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          .
          <article-title>Spark: cluster computing with working sets</article-title>
          .
          <source>In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing</source>
          , pages
          <volume>10</volume>
          {
          <fpage>10</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>