=Paper=
{{Paper
|id=Vol-2400/paper-29
|storemode=property
|title=Using CalcuList To MapReduce Json Documents
|pdfUrl=https://ceur-ws.org/Vol-2400/paper-29.pdf
|volume=Vol-2400
|authors=Domenico Saccà,Angelo Furfaro
|dblpUrl=https://dblp.org/rec/conf/sebd/SaccaF19
}}
==Using CalcuList To MapReduce Json Documents==
<pdf width="1500px">https://ceur-ws.org/Vol-2400/paper-29.pdf</pdf>
<pre>
Using CalcuList To MapReduce Json Documents
                              (DISCUSSION PAPER)


                        Domenico Saccà and Angelo Furfaro

              DIMES Dept, Università della Calabria, 87036 Rende, Italy
                    sacca@unical.it, a.furfaro@unical.it


        Abstract. CalcuList (Calculator with List manipulation), is an edu-
        cational language for teaching functional programming extended with
        some imperative and side-effect features, which are enabled under ex-
        plicit request by the programmer. As the language natively supports
        json objects, it may be effectively used to implement generic MapRe-
        duce, which is a popular model in distributed computing that underpins
        many NoSQL systems. As a list a jsons can be thought of as a dataset of
        a document NoSQL datastore, it turns out that CalcuList can be used as
        a tool for teaching advanced query algorithms for document datastores
        such as MongoDB and CouchDB.


Keywords: MapReduce, Document Databases, Query Languages, Functional
Language, Imperative Features


1     Introduction
Pure functional programming designates a functional programming paradigm
that treats all computation as the evaluation of mathematical functions. This
paradigm forbids changing-state and mutable data so that functions will only
depend on their arguments, regardless of any global or local state (i.e., it has no
side effects).
    An important member of the purely functional languages is Haskell [5, 4],
which provides relevant features including polymorphic typing, static type check-
ing, lazy evaluation and higher-order functions. Other functional languages are
less pure (impure), as they contain imperative features. For instance, most of the
languages of the Lisp Family [14] were designed to be multi-paradigm, mainly
because the use of side-effects, in some cases, simplifies the developed code or
even results in a better computational efficiency.
    CalcuList (Calculator with List manipulation) is an educational program-
ming language for teaching the functional paradigm, suitably extended with
    Copyright ©    2019 for the individual papers by the papers authors. Copying permit-
    ted for private and academic purposes. This volume is published and copyrighted by
    its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
    The full version of this paper is published in [11] and can be downloaded as an open
    access publication from https://dl.acm.org/citation.cfm?doid=3216122.3216164
some imperative features involving side effects [10]. Thus, CalcuList essentially
belongs to the family of “impure” functional languages but it may be use as a
pure functional language, in that side effects are required to be explicitly enabled.
    CalcuList syntax is rather similar to that of Python [13] and natively supports
processing of strings, lists and json objects. Json (JavaScript Object Notation) [3]
is language-independent data interchange format, which has been originally in-
troduced as a subset of the JavaScript scripting language and is now widespread
in many applications, e.g. web-services, and support for it has been added to the
standard libraries of many programming languages.
    The language is strongly typed, but type checking is mainly dynamic as most
of the type checking is done at run-time. An interactive computation session
with the user is established by means of a REPL (Read-Evaluate-Print-Loop)
shell. During a session, a user may define a number of global variables and
functions and run suitable computations on them as queries, that are computed
and displayed on the fly. CalcuList expressions and functions are first compiled
and then executed each time a query is issued.
    In this paper we shall illustrate powerful features of CalcuList to support
MapReduce, which is a programming model for processing and generating large
data sets, inspired by the functions map and reduce commonly used in func-
tional programming and introduced by Google to support parallel computations
on large data sets spread over clusters of computers. Nowadays MapReduce is
considered a popular “Data-Oriented” model in distributed computing that pro-
cesses data simultaneous data sources in two primary steps: Map and Reduce.
The first step maps the input data into intermediate data sets and the second
step aggregates the intermediate results in a consolidated result.
    MapReduce is mainly used to implement applications for a document store
(also called document database or document-oriented database), which is a sub-
set of a type of NoSQL database systems that have been released and widely
adopted in many domains poorly served by relational databases [7]. A document
store is used for storing, retrieving, and managing semi-structured data, mainly
organized as jsons. In this paper we show how to use CalcuList to implement
MapReduce programs on jsons and we argue that suitable interfaces could make
CalcuList a powerful query environment for popular document stores such as
CouchDB [12] and MongoDB [1].
     The remainder of this paper is organized as follows. Section 2 introduces
basic notions about the CalcuList functional core, illustrates higher-order func-
tions and their usage to define MapReduce queries, and describes the imperative
aspects of the language and their usage for some situations where they may sim-
plify the definition of MapReduce code or even result in a better computational
efficiency. Section 3 focus on the manipulation of json structures in CalcuList
and on the usage of higher-order functions to implement MapReduce queries on
a document datastore represented bya list of jsons. Finally, Section 4 draws the
conclusions and discusses future work.
2    An Overview of CalcuList
The main interface to CalcuList, like other languages (e.g. Python, Scala), is a
REPL environment where the user can define functions and issue valid expres-
sions (queries in CalcuList), which in turn are parsed, evaluated and printed
before the control is given back to the user.
    The basic types for CalcuList are six: (1) double, (2) int , (3) char, (4) bool
(with values true or false), (5) null (that has a unique value named null as well)
and (6) type (whose values are all the other types: double, int, etc.).
    Three compound types are supported: string, list and json. A string in Cal-
cuList is an immutable sequence (possibly empty) of characters – two strings
can be concatenated by the overloaded operator + returning a new string and
slice operators à la Python are available to extract a substring. Jsons will be
described in Section 3. Next we provide some insights on lists.
    The list is a powerful compound data type used by CalcuList for constructing
dynamic data structures and implementing recursive algorithms. A list L con-
sists of a number (possibly zero) of comma-separated elements between square
brackets. As in Python, the elements can be of any type, including list and json,
and heterogeneous. They are numbered with an index starting from zero: the
first element (called the head of the list) is L[0] (also denoted simply by L[.]),
the second element is L[1] and so on.
    A list can be extended by adding additional comma-separated elements on
top, followed by the append operator (the bar “|”) - the syntax has been inspired
by Prolog [2]. For instance, given a list L, M=[x,y|L] extends L by adding two
new elements, x and y, on top of L. Another operator on a list L is L[>], which
returns the (possibly empty) tail of L, i.e., the list starting from the element
L[1]. Slice operators on a list L (say with n elements) are deep in the sense
that they clone the elements. In particular, L[:] clones the whole list L whereas
L[i:], L[i1 :i2 ] and L[:i] clone suitable sublists.
    The syntax for an expression is like in Java. An expression e is either simple
or conditional with format e1 ? e2 : e3 , where e1 is a logical (i.e., with type bool)
expression and e2 and e3 are expressions (possibly conditional in their turn).
The meaning of a conditional expression is: its value is equal to the value of e2
if e1 evaluates to true or, otherwise, to the value of e3 .
    The language allows the user to define global variables. A global variable V
is defined as “V = e”, where e is an expression: the type of V is inferred from
the type of e and its value is kept until a later re-definition, which may change
also the type. A query is an expression e prefixed by the ^ symbol and prints
the value of the expression at the current state.
    A function is defined by giving it a name, followed by: its comma-separated
parameter names (included in parentheses), the colon symbol “:” and the (typi-
cally conditional) expression that computes the value of the function. Parameters
as well as the return value of a function are defined without specifying types for
them, so that static type checking is performed only for constant operands. A
function is compiled while it is defined and, as for Python, the type checking is
done at run time, after its call, when all the types become available.
   No side effects are allowed in the basic definitions of functions, i.e., functional
operations do not modify current global variables and always create new data
objects. Differently from Python, global variables are not accessible inside a
function so that a possible change of state does not affect the function behavior.

    >> member (x , L ): L !=[] && ( x == L [.] || member (x , L [ >]));
    >> sumL ( L ) : L ==[]? 0: L [.]+ sumL ( L [ >]);
    >> range ( x1 , x2 ): x1 > x2 ? []: [ x1 | range ( x1 +1 , x2 )];
    >> R16 = range (1 ,16); /* global variable */
    > >^ sumL ( R16 ); /* compute the sum of the elements in R16 */
    136


                             Fig. 1. Functions on Lists

    Figure 1 presents a work session with CalcuList that consists of a number of
basic functions for handling lists. The first one is the classical member function
that checks wether an element belongs to a list. The sumL function sums the
elements of a list and the range function generates the lists of the integers in a
given range. Next, a list R16 with elements in the range from 1 to 16 is defined
and the query to compute the sum of all such elements is issued.
2.1. Higher-Order Functions and MapReduce Definitions
     CalcuList supports higher-order functions since a function parameter can
also be a function and a function may return a function. A function parameter
f is written as f/n, where n is the arity of the function f. In addition, adding
/n after a function head g(...) prescribes that the function g must return a
function with arity n.
     Some examples of higher-order functions implementing MapReduce compu-
tations are presented in Figure 2. The function map receives a list L and two
functions as parameters, both with arity 1: f checks whether an element has a
certain property and, in case the test succeeds, m maps the element into another
value. The function map scans all elements of L, for each of them it performs the
possible mapping and eventually returns the list of the results. The first map
query filters all numbers divisible by 2 or by 3 (see function d2or3) in the range
from 1 to 10 and replaces each of these values with their cube (see the lambda
function) – let M denote the result of this query. Note that lambda functions
are written in Python like syntax.
     The subsequent function reduce receives a list L, a function parameter f
(mapping two values into another value) and an initial value that is returned
at final stage of the recursion, when L reduces to an empty list. Then reduce
applies f to two actual parameters: the list head and the result of reduce for the
list tail. Therefore, the list is eventually reduced to a single value by recursively
applying f from right to left. The first reduce query computes the sum of all
elements in M (the list returned by map), whereas the second one computes the
product of the squares (see second lambda function parameter) of all integers in
the range from 1 to 10 that are divisible both by 2 and by 3.
    >> map (L , f /1 , m /1) : L ==[]?[]: f ( L [.])?
         [ m ( L [.])| map ( L [ >] ,f , m )]: map ( L [ >] ,f , m );
    >> d2or3 ( x ) : x %2==0 || x %3==0;
    > >^ map ( R16 , d2or3 , lambda x : x * x * x );
    [ 8 , 27 , 64 , 216 , 512 , 729 , 1000 , 1728 , 2744 , 3375 , 4096 ]
    >> reduce (L , f /2 , in ): L ==[]? in : f ( L [.] , reduce ( L [ >] ,f , in ));
    >> sum (x , y ): x + y ;
    > >^ reduce ( map ( range (1 ,10) , d2or3 , lambda x : x * x * x ) , sum ,0);
    14499
    > >^ reduce ( map ( range (1 ,10) , lambda x : x %2==0&& x %3==0 ,
         lambda x : x * x ) , lambda x , y : x *y ,1);
    5184


             Fig. 2. High-Order Functions and MapReduce Definitions

2.2. Imperative Features of CalcuList
     CalcuList has a pure functional programming core but it also allows the
programmer to define a function with side effects by adding * next to its name
(star function). In this way, the usage of global variables inside a function is
enabled (provided that they are explicitly listed in its definition body) and they
may have assigned new values during the function execution, as in an imperative
programming language. Actually also parameter values can be modified during
the execution of a star function.
     Global variables and parameters can be updated inside a function by means
of the so-called global setting commands (GSC). A GSC is an assignment of
an expression value to a labeled global variable or to a function parameter and
represents an imperative statement with side effects. GSCs can be inserted both
before and after the expression defining the star function. In case of a conditional
expression, GSCs may be also inserted before and after each sub-expression.
The usage of GSCs has been inspired by the semantic rules of an Attribute
Grammar [6].
     An example of star function is shown in Figure 3. The function is reduceCount
that takes a list L of possibly duplicated elements and returns a list LC of pairs
[e, n], where e is an element in L and n is the number of occurrences of e in L. The
subsequent query returns the list of characters and their numbers of occurrences
in the string "Hello_World". Note that "Hello_World"@list transforms the
string in a list of characters.


3   Definitions of Jsons and Manipulation of Json Lists

CalcuList natively supports json objects (referred to simply as json), which
are at runtime represented as (possibly empty) sequences of fields separated by
comma and enclosed into curly braces. A field is a pair (key, value) separated
by a colon: key is a string and value can be of any type, including json.
    >> rc1 *( _ , _ ) : null ;
    >> rc2 *( L ,M , C ) : M ==[]? rc1 ( L [ >] ,[[ L [.] ,1]| C ]):
       L [.]== M [.][0]?{! M [0][1]+=1!} rc1 ( L [ >] , C ): rc2 (L , M [ >] , C );
    >> rc1 *( L , M ) : L ==[]? M : rc2 (L ,M , M );
    >> reduceCount *( L ) : rc1 (L ,[]);
    > >^ reduceCount (" Hello_World " @list );
    [ [ ’H ’ , 1 ] , [ ’e ’ , 1 ] , [ ’l ’ , 3 ] , [ ’o ’ , 2 ] ,
    > >^ reduceCount ( map ( R16 , lambda x : true ,
       lambda x : x %2==0? " even ": " odd "));
    [ [ " even " , 8 ] , [ " odd " , 8 ] ]


                               Fig. 3. Star Functions


    >> emps = [ { " name ": " e1 " , " age ": 30 } ,
      {" name ":" e2 " ," age ":32 ," proj ":[" p1 " ," p2 "] ," bonus ":10} ,
      {" name ":" e3 " , " age ":28 , " proj ":[" p1 " ," p3 "] } ];
    > >^ emps [2][" proj "];
    [ " p1 " , " p3 " ]
    > >^ emps [0][" proj "] %*;
    null


               Fig. 4. Json Definition and Manipulation in CalcuList


    In Figure 4 we define a variable emps as a list of three jsons, each representing
an employee. For instance, emps[1] is the employee with name "e2", aged 32
and working for the projects "p1" and "p2". Given a json J and a key k, J[k]
denotes the value of the field of J with key equal to k – if there is no such a field
then the value null is returned as it happens for ^emps[0]["projects"] (the
string %* next to the query is a display option to force the writing of the null
value). Given a value v, J[k] = v modifies the value for the field if J includes
the key k or otherwise, J is extended with a new field with key k and value v.
    Figure 5 presents a high-order function to filter the jsons that satisfy some
conditions in one of the fields: jsFilter, whose filtering condition parameter is
expressed by a function filterC, that receives the current json and a field, say
with key K and value V , to be used for the evaluation. Next, the figure shows
two implementations for filterC: (1) selKV, which checks whether the json has
a field equal to (K, V ), and (2) selKinV, which checks whether the value V is
included in the list of elements that represents the value for the the field K in
the json. Two queries on the variable emps defined in Figure 4 are issued: the
first one filter all employees with age 28 and the second one the employees that
work for at least the project "p1".
    Figure 5 also presents a classical map function that scans all employees to
produce the list of all projects for which they work – note that, by adding
the option [:] to E[.]["proj"], the project lists are cloned to avoid their
    >> jsFilter ( LJ , filtC /3 ,K , V ): LJ ==[]?[]: filtC ( LJ [.] , K , V )?
          [ LJ [.]| jsFilter ( LJ [ >] , filtC ,K , V )]:
          jsFilter ( LJ [ >] , filtC ,K , V );
    >> selVeqK (J ,K , V ): J [ K ]!= null && J [ K ]== V ;
    > >^ jsFilter ( emps , selVeqK ," age " ,28);
    [ {" name ":" e3 " , " age ":28 , " proj ":[" p1 " ," p3 "] } ]
    >> selVinK (J ,K , V ) : J [ K ]!= null && member (V , J [ K ]);
    > >^ jsFilter ( emps , selVinK ," proj " ," p1 ") ;
    [{" name ":" e2 " ," age ":32 ," proj ":[" p1 " ," p2 "] ," bonus ":10} ,
      {" name ":" e3 " ," age ":28 ," proj ":[" p1 " ," p3 "] } ]
    >> mapProj ( E ) : E ==[]? []: E [.][" proj "] != null ?
          E [.][" proj "][:]+ mapProj ( E [ >]): mapProj ( E [ >]);
    > >^ reduceCount ( mapProj ( emps ));
    [ [ " p1 " , 2 ] , [ " p2 " , 1 ] , [ " p3 " , 1 ] ]


                      Fig. 5. Higher Order Functions on Jsons


coalescence. The query calling the function reduceCount, defined in Figure 3,
reduces the list by storing each project once as a pair: project name and number
of its occurrences (i.e., the total number of employees working for it).
    Figure 6 includes a star function that updates a list of jsons. As a list of jsons
can be though of as a collection in a documentary database, star functions have
the crucial role of updating the database. As an example we have written the
function giveBonus that assigns a bonus of a given amount to all employees in a
list – the amount is incremented if an employee already has a bonus. Obviously
the function has side effects as it updates the elements in the parameter emps.
We present a query that assigns a bonus of 100 Euro to all employees in the
global variable emps (defined in Figure 4) who work in project p1.


    >> aBonus *( E , v ): E ==[]? true : E [.][" bonus "]== null ?
          {! E [0][" bonus "]= v !} giveBonus ( E [ >] , v ):
          {! E [0][" bonus "]+= v !} giveBonus ( E [ >] , v );
    > >^ aBonus ( jsFilter ( emps , selVinK ," proj " ," p1 ") ,100);
    true
    > >^ emps ;
    [{" name ": " e1 " , " age ": 30} ,
      {" name ":" e2 " , " age ":32 , " proj ":[" p1 " ," p2 "] , " bonus ":110} ,
      {" name ":" e3 " , " age ":28 , " proj ":[" p1 " ," p3 "] , " bonus ":100}
    ]


                    Fig. 6. Functions with Side Effects on Jsons
4   Conclusion

In this paper we have presented a new educational functional programming lan-
guage extended with imperative programming features: CalcuList, whose imper-
ative features are enabled under explicit request by the programmer. CalcuList
expressions and functions are first compiled and then executed each time a query
is issued. The CalcuList programming environment has been implemented as a
small-sized Java project in Eclipse 4.4.1 with 6 packages and 20 classes all to-
gether. The Jar File for using Calculist may be dowloaded from the link in [8].
The size of this file is rather small: 134 kb. The draft of a tutorial on CalcuList
may be dowloaded from the link in [9].


References
 1. K. Chodorow. MongoDB: The Definitive Guide. O’Reilly Media, Inc., 2013.
 2. W. F. Clocksin and C. S. Mellish. Programming in Prolog (2Nd Ed.). Springer-
    Verlag New York, Inc., New York, NY, USA, 1984.
 3. ECMA Int. The JSON Data Interchange Format. Standard ECMA-404, Ecma
    International, 2013.
 4. P. Hudak, J. Hughes, S. Peyton Jones, and P. Wadler. A History of Haskell:
    Being Lazy with Class. In Proceedings of the Third ACM SIGPLAN Conference
    on History of Programming Languages, HOPL III, pages 12–1–12–55, New York,
    NY, USA, 2007. ACM.
 5. S. Marlow. Haskell 2010 Language Report.
 6. J. Paakki. Attribute Grammar Paradigms – a High-level Methodology in Language
    Implementation. ACM Comput. Surv., 27(2):196–255, June 1995.
 7. V. Reniers, D. Van Landuyt, A. Rafique, and W. Joosen. On the State of NoSQL
    Benchmarks. In Proceedings of the 8th ACM/SPEC on International Conference
    on Performance Engineering Companion, ICPE ’17 Companion, pages 107–112,
    New York, NY, USA, 2017. ACM.
 8. D. Saccà. The CalcuList Environment, Release 4.1.0.
    https://github.com/saccad/CalcuList/blob/master/CalcuList R 4 1 0.jar, 2017.
 9. D. Saccà. The CalcuList Release 4.1 Tutorial, Version 1.
    https://github.com/saccad/CalcuList/blob/master/CalcuList Tutorial Release 4 v3 5.pdf,
    2017.
10. D. Saccà and A. Furfaro. CalcuList: a Functional Language Extended with Imper-
    ative Features. CoRR, abs/1802.06651, 2018.
11. D. Saccà and A. Furfaro. Using calculist to mapreduce jsons. In B. C. Desai,
    S. Flesca, E. Zumpano, E. Masciari, and L. Caroprese, editors, Proceedings of
    the 22nd International Database Engineering & Applications Symposium, IDEAS
    2018, Villa San Giovanni, Italy, June 18-20, 2018, pages 74–83. ACM, 2018.
12. C. Team. CouchDB 2.0 Reference Manual. Samurai Media Limited, United King-
    dom, 2015.
13. G. van Rossum and J. Fred L. Drake. The Python Language Reference Manual
    (version 3.2). Network Theory Limited, 2011.
14. P. Winston and B. Horn. LISP. Second edition. Addison Wesley Pub.,Reading,
    MA, Jan 1986.

</pre>