<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">ProcessFast, a Java framework for the development of concurrent and distributed applications</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Andrea</forename><surname>Esuli</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Istituto di Scienza e Tecnologie dell&apos;Informazione Consiglio Nazionale delle Ricerche</orgName>
								<address>
									<postCode>56124</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tiziano</forename><surname>Fagni</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Istituto di Scienza e Tecnologie dell&apos;Informazione Consiglio Nazionale delle Ricerche</orgName>
								<address>
									<postCode>56124</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">ProcessFast, a Java framework for the development of concurrent and distributed applications</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A3541A32A9A5675C9B8408791FFA35D1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:45+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Today, any application that requires processing information gathered from the Web will likely require a parallel processing approach to be able to scale. While writing such applications, the developer should be able to exploit several types of parallelism paradigms in a natural way. Most of the available development tools are focused on just one of these parallelism types, e.g. the data parallelism, stream processing, etc. In this paper, we introduce ProcessFast, a Java framework for the development of concurrent/distributed applications, designed to allow the developer to integrate both stream/task parallelism and data parallelism in the same application and to seamlessly combine solutions to sub-problems where each solution exploits a specific programming model.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Most of the frameworks for parallel and distributed computing focus on a single parallel computing paradigm, e.g., stream processing <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b4">5]</ref>, map-reduce <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8]</ref>, task parallelism <ref type="bibr" target="#b3">[4]</ref>. Complex applications could benefit from using a combination of different approaches. For example, a text mining system that classifies a stream of tweets can be decomposed into a set of parallel tasks (crawling, NLP processing, indexing, classification, aggregation, report) connected via streams, with each task possibly exploiting data parallelism to efficiently apply the same computation to batches of data. Implementing such a system usually requires to combine several frameworks, with the added burden of implementing a communication layer among them. Moreover, the target architecture of the system, e.g., a single multi-core machine or a distributed environment, will result in different choices of frameworks, since each framework is usually targeted to produce its maximum efficiency on a specific architecture. Changing the runtime architecture will often require to change the underlying parallel computing framework<ref type="foot" target="#foot_0">1</ref> , with non trivial implementation cost. We are developing ProcessFast (PF), a Java framework that aims at providing a seamless integration of different parallel computing models, by combining the functionalities provided by different parallel processing frameworks into an homogeneous API. PF does not aim at implementing yet another parallel computing framework, its purpose is to act as a higher-level API that abstracts the functionalities of current parallel computing frameworks, allowing to write scalable applications that once developed can be deployed, and executed efficiently, on different architectures by only switching the PF runtime layer that implements the API on the better suited frameworks. This paper introduces the PF main concepts, structures, and functionalities. The current status of the development consists of the PF API and a first implementation of the API on a single machine architecture mainly based on wrapping the GPars library <ref type="bibr" target="#b1">[2]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">ProcessFast, an overview of the programming model</head><p>The PF API<ref type="foot" target="#foot_1">2</ref> defines how the developer can write applications that use and combine task/stream parallelism and data parallelism. It provides a lock-free programming model in which an application can be defined in terms of a set of asynchronous processes that are able to intercommunicate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">TaskSet, a logical container and a reusable "black box"</head><p>As shown in Figure <ref type="figure" target="#fig_0">1</ref>, the topology of an application is defined inside a TaskSet. A TaskSet is a logical container of Tasks (detailed in Section 2.2) and Connectors (detailed in Section 2.3). A TaskSet is also a Task, and thus can be used as a "black box" inside another TaskSet. A TaskSet allows to implement task/stream parallelism through the use of Connectors to let the contained Tasks communicate together. Barriers can be used to synchronize the execution of Tasks. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Task, a stateful and asynchronous logical process</head><p>A Task is the minimal unit of execution in PF. Every Task is a stateful logical process which runs asynchronously with respect to the other tasks in the application and can be created in multiple instances inside a single program. A Task is able to communicate through Connectors only with the other Tasks defined in the same TaskSet, or externally using the input and output Connectors of the parent TaskSet (which define the entry and exit points of the TaskSet taken as a black box). As shown in Figure <ref type="figure" target="#fig_1">2</ref>, a Task internally can exploit the data parallelism by operating concurrently on PartitionableDatasets (PDs). The concept of PD is very similar to that of RDD in Spark <ref type="bibr" target="#b7">[8]</ref>. A PD is a read-only data structure which can be split in n partitions, where every partition can then be processed concurrently as a data stream. PDs can be processed by applying two types of operations, following a map-reduce model:</p><p>transformations: each item in the input stream is transformed in some way resulting in a new item in the output stream (e.g. T1 and T2 in Figure <ref type="figure" target="#fig_1">2</ref>). actions: items from input data stream are collected and processed to produce an aggregated result that is returned to the caller task (e.g. A1 in Figure <ref type="figure" target="#fig_1">2</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Connectors, interprocess communication based on queues</head><p>A Connector is a shared queue, belonging to a TaskSet, uniquely identified by a name. Connectors can have single or multiple Tasks attached as readers and writers. A Task can consume exclusively an item read from a Connector (first come, first served, data parallelism) or the Connector can provide to each reading Task the same complete sequence of items as written to the connector (broadcasting, task parallelism). Write operations are generally asynchronous (a Task after posting a message on a specific connector can continue its computation) while the read are synchronous and blocking, though is it also possible to define synchronous read/write connections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Shared permanent storage</head><p>The PF API defines a shared permanent storage which allows direct access to some high levels data structures. The main purpose of the permanent storage is to reduce the amount of information transmitted through connectors and to rely instead on solutions that are best fit for the target architectures. The supported data structures are:</p><p>-Array: a direct access unidimensional array. The structure supports PD views, thus it is ready for parallel processing. -Matrix : a direct access bidimensional array. The structure supports PD views, allowing parallel processing by rows or by columns. -Dictionary: a key/value collection.</p><p>-DataStream: a byte stream used to load/store data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Conclusions</head><p>We have introduced the main ideas behind PF, whose grand goals are (i) to allow developers to seamlessly integrate different parallel computing models into their applications, and (ii) to implement a write once, run (efficiently) anywhere model for parallel/distributed applications. We have recently completed the first implementation of the ProcessFast API based on Groovy GPars library <ref type="bibr" target="#b1">[2]</ref>. One of our first tests, on a eight cores CPU, obtained a five-fold improvement against a sequential implementation of matrix multiplication using matrixes with a size of 10'000 by 10'000. Future development will focus on implementing the API on a distributed architecture and on running comparative tests with the well-known alternatives (e.g., Hadoop, Spark, Storm).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Schema of a ProcessFast application</figDesc><graphic coords="2,134.77,115.83,345.83,155.85" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. The internal structure of a Task</figDesc><graphic coords="3,134.77,115.83,345.83,154.77" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Many tools have a standalone mode that simulates a distributed environment on a single machine, but it is mainly thought as a development tool, not for production use.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">ProcessFast public repository: https://github.com/tizfa/processfast-api</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<ptr target="http://samza.apache.org/" />
		<title level="m">Apache Samza</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<ptr target="http://gpars.codehaus.org/" />
		<title level="m">Gpars: Groovy parallel system</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Mapreduce: Simplified data processing on large clusters</title>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjay</forename><surname>Ghemawat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="107" to="113" />
			<date type="published" when="2008-01">January 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Scioto: A framework for global-view task parallelism</title>
		<author>
			<persName><forename type="first">J</forename><surname>Dinan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Krishnamoorthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">B</forename><surname>Larkins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jarek</forename><surname>Nieplocha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sadayappan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICPP &apos;08. 37th International Conference on</title>
				<imprint>
			<date type="published" when="2008-09">2008. Sept 2008</date>
			<biblScope unit="page" from="586" to="593" />
		</imprint>
	</monogr>
	<note>Parallel Processing</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Samoa: Scalable advanced massive online analysis</title>
		<author>
			<persName><forename type="first">Gianmarco</forename><surname>De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Francisci</forename><surname>Morales</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Albert</forename><surname>Bifet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="149" to="153" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Storm@twitter</title>
		<author>
			<persName><forename type="first">Ankit</forename><surname>Toshniwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Siddarth</forename><surname>Taneja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Amit</forename><surname>Shukla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karthik</forename><surname>Ramasamy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jignesh</forename><forename type="middle">M</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjeev</forename><surname>Kulkarni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jason</forename><surname>Jackson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Krishna</forename><surname>Gade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maosong</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jake</forename><surname>Donham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikunj</forename><surname>Bhagat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sailesh</forename><surname>Mittal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dmitriy</forename><surname>Ryaboy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;14</title>
				<meeting>the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;14<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="147" to="156" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Tom</forename><surname>White</surname></persName>
		</author>
		<title level="m">Hadoop: the definitive guide: the definitive guide</title>
				<imprint>
			<publisher>O&apos;Reilly Media, Inc</publisher>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Spark: cluster computing with working sets</title>
		<author>
			<persName><forename type="first">Matei</forename><surname>Zaharia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mosharaf</forename><surname>Chowdhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Scott</forename><surname>Michael J Franklin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ion</forename><surname>Shenker</surname></persName>
		</author>
		<author>
			<persName><surname>Stoica</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd USENIX conference on Hot topics in cloud computing</title>
				<meeting>the 2nd USENIX conference on Hot topics in cloud computing</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="10" to="10" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
