=Paper= {{Paper |id=Vol-2404/paper20 |storemode=property |title=A Scalable and Distributed Actor-Based Version of the Node2Vec Algorithm |pdfUrl=https://ceur-ws.org/Vol-2404/paper20.pdf |volume=Vol-2404 |authors=Gianfranco Lombardo,Agostino Poggi |dblpUrl=https://dblp.org/rec/conf/woa/LombardoP19 }} ==A Scalable and Distributed Actor-Based Version of the Node2Vec Algorithm== https://ceur-ws.org/Vol-2404/paper20.pdf
                                       Workshop "From Objects to Agents" (WOA 2019)


 A Scalable and Distributed Actor-Based Version of
             the Node2Vec Algorithm
                        Gianfranco Lombardo                                               Agostino Poggi
            Department of Engineering and Architecture                    Department of Engineering and Architecture
                        University of Parma                                          University of Parma
                           Parma, Italy                                                  Parma, Italy
                   gianfranco.lombardo@unipr.it                                    agostino.poggi@unipr.it



   Abstract—The analysis of systems that can be modeled as            attributed networks: an interaction network and a friendship
networks of interacting entities is becoming often more important     network. In [3] the authors analyze the same community in
in different research fields. The application of machine learning     order to extract the giant component without using topology
algorithms, like prediction tasks over nodes and edges, requires
a manually feature extraction or a learning task to extract them      information with a bio-inspired algorithm. In [4] the authors
automatically (embedding techniques). Several approaches have         uses a temporal network to model the USA financial market in
been proposed in the last years and the most promising one            order to discover correlations among the dynamics of stocks’
is represented by the Node2Vec algorithm. However, common             cluster and to predict economic crises. In [5] the authors
limitations of graph embedding techniques are related to memory       modeled the interaction between proteins as a network, with
requirements and to the time complexity. In this paper, we
propose a scalable and distributed version of this algorithm called   the aim of automatic predicting a correct label for each protein
ActorNode2vec, developed on an actor-based architecture that          describing its functionalities. In light of this, this formulation
allows to overcome these kind of constraints. We demonstrate the      of the complex systems enables analysis that keep in consider-
efficacy of this approach with a real large network by analyzing      ation not only the features of the entities (nodes), but also the
the sensitivity of two algorithm’s parameters (walk length and        relationships that get established among them. For this reason,
number of walks) with a comparison with the original algorithm.
Results shows an average reduction between the 65% and the            the application of machine learning techniques to structured
82% in terms of the required computational times.                     data, such as node classification task and link prediction, rep-
   Index Terms—Network science, embedding, graph embedding,           resents one of the most challenge in Data Mining and Network
node2vec, actodes, distributed systems, data mining, complex          Science. Traditionally these tasks have been performed using
systems, random walk,actor model                                      statistical model with a manual extraction of features based
                                                                      on the network topology (e.g degree, clustering coefficient,
                       I. I NTRODUCTION                               modularity). During the last years, thanks to the adoption of
   Real systems are often characterized by heterogeneous or           neural networks in other data mining contexts, several works
similar entities that interact with each other. These interac-        have highlighted the importance of performing previously a
tions, with their characteristics and the possible feature of         Representation Learning task to extract automatically features
the entities, are the key factors to discover the dynamics            from networks that can preserve local and global information
of these systems. These types of systems are pervasive in             from the graph topology. The most promising approaches are
various disciplines such as sociology, health science, computer       represented by the embedding techniques, with the aim of
science, physics and finance. For this reason, the study of           learning nodes and edges representations in a n-dimensional
the, so-called, complex systems is gaining increasing interest        euclidean space. To the best of our knowledge, the State of
in various research fields. The analysis of complex systems           Art in this field is represented by the Node2Vec algorithm [6].
involves often the use of networks or graphs to model the             However, the main limitations of graph embedding techniques
behavior of the system. The basic idea is describing the              are related to memory constraints and the increase of compu-
entities in form of nodes and their interactions and dynamics         tational times that make sometimes infeasible the application
with (un)directed edges. For decades the study of graph-data          of this techniques on real and large networks. In this paper, we
has been limited to analysis of the network topology with             propose a scalable version of the Node2Vec algorithm based
structural metrics that are capable of extracting connectivity        on a distributed actor-based framework: ActorNode2Vec. This
patterns among the system entities. More recently, with the           version of the algorithm, has been developed on ActoDeS
development of machine learning techniques, it is emerged             [7], a framework for the development of large concurrent
the idea of taking advantages from this kind of structures,           and distributed systems. We demonstrate the efficacy of this
also to perform prediction tasks and knowledge discovery. For         approach with a real large network (Enron network [8]) by
example in [1] and [2] the authors analyze a Facebook group           analyzing the sensitivity of two algorithm’s parameters (walk
of patients to extract new knowledge about their emotional            length and number of walks). Results show a significant
state and disease temporal pattern by modeling them in two            improvement in terms of reduction of computational times and




                                                                  134
                                      Workshop "From Objects to Agents" (WOA 2019)

the overcome of the memory constraints in the case of a real         the State of Art in graph embedding is Node2Vec [6] that is
large network.                                                       a random walk based algorithm.
                                                                                    III. T HE N ODE 2V EC ALGORITHM
                   II. G RAPH EMBEDDING
                                                                        The Node2Vec algorithm aims to learn a vectorial rep-
   Combining machine learning and graph analysis techniques          resentation of nodes (edges) in a graph by optimizing a
to work with networks is challenging since decades and two           neighborhood preserving objective. It extends the previous
main general approaches have been proposed in literature:            node embedding algorithm DeepWalk [14] and it is inspired
  • Ad-hoc learning models for graph: Building new learn-            to the State of Art word embedding algorithm Word2Vec [15].
    ing models that can take directly a network as input and         From the last one, Node2Vec retrieves the Skip-gram model
    that are specialized in prediction tasks on graph-data.          [13] and the idea of learning embeddings by analyzing the
    The most important advances in this case are represented         relationships among the entities in the form of sequences and
    by the Graph Neural Network Model [9] that makes a               the use of a shallow neural network model. Infact, Word2Vec
    mapping of the network topology as units of a Recurrent          learns word embeddings by analyzing the context of each word
    Neural Network and Graph Convolutional Network [10]              in a corpus. It predicts the current word from a sliding window
    that takes the adjacency matrix and nodes specific fea-          of surrounding context words. The key idea of applying this
    tures as input to learn prediction tasks by using different      model in the graph domain is to build something similar
    kinds of convolution operations in the graph domain.             to a text corpus (a set of word sequences) by performing
  • Embedding techniques for non-euclidean data: The                 random walks on the graph. The result can be thought as
    idea of finding a latent vector representation that is able to   a set of word sentences where words are substituted with
    capture the key features of the data is common in different      nodes that compose the walk. The innovation of Node2Vec
    Data mining and Machine Learning tasks (e.g Natural              with respect to DeepWalk is in the method that is used to
    Language Processing, Computer Vision, Dimensionality             build biased random walks on the network. In the previous
    Reduction). The main reason of this approach lies on the         one infact, random walking is obtained by a uniform random
    need of use the well-known machine learning models (e.g          sampling over the linked nodes, while Node2Vec combine two
    SVM, decision trees, regression models and neural net-           different strategies for the network exploration: Depth-First-
    works) that requires an euclidean feature representation         Search (DFS) and Breadth-First-Search (BFS), see Figure 3. It
    to work with non-euclidean heterogeneous data, such as           uses also a second-order Markov chain and the Alias sampling
    text, graph-data, collection of images, database records.        to select the next node to be visited in the walk. The main
    In this paper we will refer to this approach.                    parameters of the algorithm are presented in Table I, while
Graph embedding techniques can be divided in two general             the main steps of the algorithm are the following:
sectors: Factorization based Methods and Random walk                    1) Probabilities computation over edges with a second-
based methods. Algorithms of the first case obtain the em-                  order Markov chain.
bedding by factorizing specific matrix of the graph, like the           2) Alias sampling and random walks generation.
Adjacency matrix, Laplacian matrix or the node transition               3) Embedding with the Neural network model.
probability matrix. Approaches to factorize the representative
matrix vary based on the matrix properties. If the considered                                      TABLE I
matrix is positive semi-definite, e.g. the Laplacian matrix,                            M AIN PARAMETERS OF N ODE 2V EC
one can use eigenvalue decomposition. In other cases it is                 Parameter                            Description
possible to factorize using gradient descent methods. Some                  Dimension               Dimension of the embedding space
examples of this kind of embedding algorithms are LLE [11]               Number of walks        N◦ of walks to be generated per each node
                                                                           Walks length              The desired length for each walk
which preserves first-order proximity and LINE [12] which              P (Return parameter)       Controls the likelihood of immediately
preserves also the second-order proximity in the graph. In-                                           revisiting a node in the walk.
stead, algorithms of the second case obtain a graph embedding                                 If high more DFS, else more BFS exploration
                                                                       Q (In-out parameter)           If Q higher than 1 more BFS
by generating walks on the edges randomly and processing                                                else more DFS exploration
them using different learning models to extract and preserve
key feature of the graph. The main advantages of random
walks are that are useful to approximate many properties in          A. Probabilities computation
the graph, including node centrality and similarity and the easy        In order to feed the neural network for the embedding
generation of them. This approach is gaining interest for its        step, Node2Vec has to perform a fixed number of random
simplicity and thanks to the adoption of the Skip-gram model         walks starting from each node, with a fixed length. These
[13] as learning model, largely used in text embedding yet.          walks are based on a second-order Markov chain that takes in
The key innovation of this approach is optimizing the node           consideration the parameters P and Q to mix the two different
embeddings so that nodes have similar representations if they        search strategies introduced previously. To achieve this result,
tend to co-occur on short random walks over the graph. To            the first duty of the algorithm is to generate all of the required
the best of our knowledge the best algorithm that outperforms        probabilities over the edges for the next step of random walks




                                                                 135
                                        Workshop "From Objects to Agents" (WOA 2019)




                                                                   Fig. 2. Example of the calculus of probabilities over edges with a second-
         Fig. 1. BFS and DFS search strategies from node S.        order Markov chain


                                                                   the alias table. To take a decision the algorithm uses a system
generation. Formally, given a source node t, that can be also
                                                                   based on a biased coin to be flipped. At the end of this phase,
an intermediate step for other walks, a 2nd order walk has to
                                                                   random walks get collected in a structure that plays the role
be generated by considering the 2nd order neighborhood of
                                                                   of a sentences corpus in text mining, where sentences are
node t. Let c0 = t and c1 = v, where v is a first order neighbor
                                                                   composed by an ordered sequence of nodes that have been
of node t. The next edge to be traversed is chosen by the
                                                                   traversed by each walk.
following distribution:
                            ω · α (t,x)                           C. Embedding with Neural networks
                            vx Zpq           if (v, x) ǫ E
  P (Ci = x|Ci−1 = v) =                                      (1)      The neural network model used by Node2Vec is exactly
                           
                                   0           otherwise           the same of Word2Vec [15]. The basic idea is to train a
                                                                   simple neural network with a single hidden layer to predict
   where ωvx is the edge weight, Z is the normalizing constant     the most probably word that follows the input one, but then
and αpq (t, x) is the search bias function. The bias function α    the algorithm does not use this model for the task we trained it
is defined with the aim of joining BFS and DFS strategies:         on. Instead, the goal is actually just to learn the weights of the
                              1
                                    if dtx = 0                     hidden layer that will represent our euclidean representation
                               P
                              
                              
                              
                                                                  of the input. The training set is composed by words pairs
                 αpq (t, x) = 1 if dtx = 1                   (2)   (x,y) where x is the input word and y is a nearby word of
                              
                              
                                                                  x in a fixed window in each sentence of the corpus. In the
                               1 if d = 2
                              
                                         tx
                                                                   case of Node2Vec the operation is the same but the algorithm
                                Q
                                                                   uses node pairs and the corpus is composed by the previous
Where dtx is the geodesic distance between the starting node       calculated random walks.
t and the considered node x. An example of this step is
presented in Figure 2. At the end of this phase, each edge of                          IV. ACTO D E S OVERVIEW
the networks has a set of values that describe the probabilities      ActoDeS (Actor Development System) is a Java software
for the edge to be traversed during a random walk, depending       framework for the development of concurrent and distributed
on the source node. This computational step represents the first   applications [17], [18] that has been experimented in different
bottleneck of the algorithm in terms of memory and required        application domains (see, for example, [17], [19], [20], [21]).
computational times because it depends on the dimension of         This software framework allows the definition of applications
the network (nodes and edges number) and on its density.           where the computation is distributed on some concurrent
How to overcome the emerging limits of this calculus, while        objects (from here called actors), derived from the actor model
analyzing a large network, will be further explained in the V      [22] and are implemented on the top of some preexisted Java
section.                                                           software libraries and solutions for supporting concurrency and
                                                                   distribution. These actors interact by exchanging asynchronous
B. Generation of the random walks                                  messages and, after their creation, can change several times
   Once that a set of probabilities is associated to each edge     their behaviors until they kill themselves. Each behavior has
depending on the second-order neighborhood of each node,           the main duty of processing the incoming messages that match
Node2Vec is ready to perform effectively the random walks          some message patterns. Therefore, if an unexpected message
based on the Number of walks and Walks length parameters.          arrives, then the actor maintains it in its mailbox until a
To sample the edges to be traversed from the probability           behavior will be able to process it, or a behavior kills the
distribution obtained in the previous step, the algorithm uses     actor. The processing of the incoming messages is performed
the Alias method [16]. This method enables the algorithm to        through some message handlers. In response to a message, an
sample efficiently from the discrete probability distribution by   actor uses the associated handler for: sending other messages,
building and consulting two tables: the probabilities table and    creating new actors, changing its local state or its behavior,




                                                               136
                                          Workshop "From Objects to Agents" (WOA 2019)

setting a timeout within receiving a new message and finally                               V. ACTOR N ODE 2V EC
killing itself. In particular, when a timeout fires, the actor sends
                                                                          In this section, we present our contribution in the develop-
automatically a timeout message to itself and the correspond-
                                                                       ment of a distributed actor-based version of the Node2Vec
ing message handler is executed. Depending on the complexity
                                                                       algorithm, ActoNode2Vec, with the aim of improving per-
of the application and on the availability of computing and
                                                                       formance in terms of the required computational times and
communication resources, an application can be distributed
                                                                       scalability. In order to achieve this result we started by
on one or more actor spaces. Each actor space corresponds to
                                                                       analyzing the three different phases of the algorithm in order
a Java virtual machine and so a system distributed on some
                                                                       to detect the main bottlenecks of the algorithm.
actor spaces can be deployed on one or more computational
nodes. An actor space acts as container for a set of actors
                                                                       A. Computational limits of Node2Vec
and provides them the necessary services for their execution.
In particular, an actor space performs its tasks through three            Although Node2Vec is presented as a scalable algorithm,
main run-time components, (i.e., controller, dispatcher and            the original formulation and implementation involve the use
registry) and through two run-time actors (i.e., the executor          of a multi-threading solution only for the second phase of
and the service provider). In particular, the controller manages       the random walk generation (workers parameter) to reduce
the execution of an actor space. In particular, it configures          the computational times. However experimenting Node2Vec
and starts the run-time components and the actors of the               with large networks it is immediately clear that the first phase
application, and manages its activities until the end of its           of the probabilities computation is the most critical point
execution by using different communication technologies (the           of the architecture. This issue is due to the construction of
current release of the software supports ActiveMQ, MINA                the required probability distribution presented in equation 1.
RabbitMQ, RMI, and ZeroMQ). The registry is a run-time                 The use of a second-order Markov chain considering the
component that supports actor creation and message passing.            second-order neighborhood of each node, inevitably has the
In fact, it creates the references of new actors and supports the      consequence of leading to a heavy dependency on the number
delivery of those messages that come from remote actors, by            of nodes and in particular on the network density. Increasing
mapping each remote reference onto the local reference of the          the dimension of the network (number of nodes and edges) in
final destination of the message. The executor actor manages           fact, leads to two main critical issues:
the execution of the actors of an actor space and in some cases          • Memory requirements become computationally un-
creates its initial set of actors. Finally, the service provider           feasible: Memory availability represents often a critical
actor provides am extensible set of services to the other actors           point in large networks mining. Requirements become
of the application that allows them to perform new kinds of                infeasible on a single calculus node when the probability
actions (e.g., to broadcast or multi-cast a message and to move            distribution construction requires to store a vector of
from an actor space to another one). The development of a                  probabilities on each edge depending on the second order
standalone or distributed application consists in the definition           Markov chain.
of the behaviors assumed by its actors and in the definition             • The explosion of algorithm’s time complexity: Re-
of few configuration parameters that allow the selection of                quired time to construct the probability distribution in-
the more appropriate implementation of the actors and of the               creases depending on the dimension of the graph and on
other run-time components of the application. This solution                random walks parameters (e.g number of walks and walk
allows the optimization of some or others execution attributes             length)
of an application (e.g., performance, reliability and scalability).    In light of this, this phase of the algorithm represents a
Moreover, the deployment of an application on a distributed            first considerable bottleneck of the algorithm. Our proposal
architecture is simplified because an actor can move to another        is to overcome this issue using an actor-based architecture
computational node and can transparently communicate with              that enables a reformulation of this step. The key idea is to
remote actors.                                                         distribute the computation of the probabilities distribution in
                                                                       equation 1 using several specialized actors that take the burden
                                                                       of this computation each one for a subset of actors without the
                                                                       need of a preliminary ordering of nodes in the graph.

                                                                       B. The actor-based architecture
                                                                         In order to implement an actor-based solution, we have
                                                                       defined different actor behaviors and elements that compose
                                                                       the software architecture:
                                                                         •   Remote Launcher: It represents the algorithm initializa-
             Fig. 3. The actor space structure in ActoDeS.                   tion point, it has the duty of deciding how many and
                                                                             which computational nodes have to be involved in the
                                                                             execution of the algorithm.




                                                                   137
                                       Workshop "From Objects to Agents" (WOA 2019)

  • ActoDeS Broker: It is a software component imple-               single node because operation on the graphs are read-only.
    mented in ActoDeS that has the duty of initializing             (See Figure 5).
    communications among the actor spaces.
  • Node2Vec Initiator: This behavior has the duty of ini-
    tializing the algorithm. It requires some starting messages
    from the remote launcher to set the Node2Vec parameters
    and the input. network to embed. The actor space that
    contains the unique actor with this behavior is defined as
    the primary actor space.
  • Actorspace Initiator (AI) : It has the duty of initializing
    the necessary actors for the algorithm in the actor space
    that do not contain any Node2Vec Initiator (secondary
    actor spaces). In detail it has the burden of creating the
    actors that are responsible for the probabilities computa-
    tion.                                                                    Fig. 4. Starting step and creation of the actor spaces
  • Coordinator: This behavior represents the core behavior
    of the algorithm. Once the initialization operations are
    terminated, it has the duty of managing all of the actors
    in the architecture and of managing the entire graph
    embedding process.
  • Probabilities Manager (PM) This behavior represents
    the reformulation of the Probabilities computation in the
    original Node2Vec algorithm. Each actor that assumes
    this behavior has the duty of computing probabilities
    only for a subset of nodes, considering their second-order
    neighborhood.

C. Starting and initialization steps
   The execution of the algorithm is performed by the Remote
                                                                                 Fig. 5. Initialization step of ActorNode2Vec
Launcher (RL) introduced in the previous section. The RL
represents a server application that has the duty of reading the
input network from the memory, of choosing the embedding            D. Distributed and actor-based probabilities computation
parameters and of initializing different ActoDeS actor spaces          This step represents the key difference between the original
in the distributed network. It initializes the primary actor        version of the algorithm and our proposal. In light of the limits
space, requiring the creation of the Node2Vec initiator on it       presented previously, we propose a reformulation of this step
and by sending to this actor the graph and the parameters.          where the required probability distribution is computed not
At the same time, the RL initializes an ActoDeS broker on a         on a single thread and neither with a multi-thread solution
secondary actor space to establish the communication among          but dividing and distributing the entire computation. The
the different actor spaces that build the architecture (Figure      simple multi-thread solution on a single computational node
4). After this step the RL main duties are concluded and it         has been excluded because it does not permit to overcome
can wait a notification of the algorithm’s termination. At the      the memory issue presented in section A in the case of a
same time, the Node2Vec Initiator actor requests the creation       large network. We defined a specialized typology of actor’s
of an Actorspace Initiator on each secondary actor space and        behavior (Probabilities Manager or PM) that has the duty of
sends to each one the entire graph to be embedded and the           computing probabilities for a subset of the graph nodes. Once
Node2Vec parameters. The previous ActoDeS broker become             the initialization step is terminated, the Node2Vec Initiator
another Actorspace initiator for its actor space. These different   changes its behavior into Coordinator. This last actor, as its
Initiators generate several Probabilities Manager actors (PM)       name suggests, it is the real execution responsible of the
to prepare the architecture for the next steps. The number of       following algorithms step. Its first action is to receive all of
actors that are created is a parameter, but the default value       the PM’s references from the Actorspace Initiators and then it
is equal to the number of available logical CPUs on the             kills the latters as they are no longer necessary. Secondly, the
network node. Each PM actor receives the entire graph and           Coordinator sends all of the actors references to each PM actor.
the other parameters. This design choice is motivated by the        This operation is necessary to enable the PMs to understand
next steps of the algorithm and in particular to do not have to     which subset of nodes are under their responsibilities for the
order the nodes of the graph,that would make computational          probabilities computation. The Probabilities Managers behav-
complexity exponential. Not only, this choice is due also to        ior will be explained in detail in the next section. Once the
take advantages from the shared memory of the actors on a           probabilities have been calculated for each second-order node




                                                                138
                                           Workshop "From Objects to Agents" (WOA 2019)

neighborhood by the PMs, they send them to the Coordinator              represent the second bottleneck of the system in terms of time
that builds effectively the probability distribution for the next       complexity, in particular depending on which technologies are
steps. Figure 6 describes this phase of the algorithm.                  used to implement the model. Some approaches to resolve this
                                                                        other issue are presented in the future developments section.
                                                                                     VI. E XPERIMENTATION AND R ESULTS
                                                                          We experimented ActorNode2Vec with a well-known large
                                                                        network ”Enron”, to demonstrate the benefits of the actor-
                                                                        based solution and its efficiency in terms of time complexity.
                                                                        Memory issues are overcome by the scalability of the actor-
                                                                        based solution.
                                                                        A. The Enron Email dataset
                                                                           The Enron Email dataset is a large collection of over
                                                                        600,000 emails generated by the past employees of the Enron
      Fig. 6. Distributed and actor-based probabilities computation     Corporation and acquired by the Federal Energy Regula-
                                                                        tory Commission during its investigation after the company
E. Probabilities Managers behavior                                      bankruptcy in 2001. This collection has been processed for
   Probabilities Managers have the duty of computing proba-             several scientific studies (e.g [8], [23]), analyzing discussion
bilities following the second-order Markov chain model pre-             threads and the content of the email. These works have
sented in equations 1 and 2. Each one is responsible for                identified the most important component in about 37 thousands
a subset of the graph nodes. PMs know the references of                 of email addresses. In particular in [23] they modeled this
each other in order to identify the number of actors involved           component as a graph by considering each email address as a
in that calculus and to divide the nodes set. This choice               node (36692 nodes), and the emails exchanges as undirected
permits to avoid nodes ordering that is an operation that               edges (183831 edges). Currently it is considered one of the
grows exponentially with the number of nodes. In this way,              most important standard network in Network Science.
each PM has the visibility of the entire graph to consider              B. Experiments setup
the neighborhood of each node and it is computationally
                                                                          In order to compare Node2Vec and ActorNode2Vec with the
efficient using the advantages of the shared memory among
                                                                        Enron network we used a computer cluster with 4 linux nodes.
the actors on a single working node. The nodes set is divided
                                                                        The most performing one (Node 1) has been used to execute
by each PM using an ordered list of the other involved PMs
                                                                        Node2Vec and The coordinator of ActorNode2Vec, the others
references, so each actor is able to detect the subset under its
                                                                        to execute the Remote Launcher, the ActoDeS broker and
responsibility and who are the responsible for the other nodes.
                                                                        the secondary actor spaces for the probabilities computation.
Once probabilities have been computed by each actor using
                                                                        Hardware specifications are the followings:
for the first-order neighborhood the edges weights and for the
                                                                          • Node 1: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
second-order the Markov chain model, they also prepare the
probability distribution to be sampled with the alias method.                with 8 cores per socket, 2 threads for each core, 64 GB
The output of each PM in fact, are two data structures, one for              of RAM
                                                                          • Node 2: Genuine Intel(R) CPU T1300 @ 1.66GHz, with
the probabilities and one for the related alias as the method
requires [16]. At the end each PM sends these two structures                 4 cores per socket, 1 threads for each core, 16 GB of
to the Coordinator, which has the duty of collecting and                     RAM
                                                                          • Node 3: Genuine Intel(R) CPU T1300 @ 1.66GHz, with
joining them in a single probability distribution. To summarize,
Probabilities Managers compute probabilities for a subset of                 4 cores per socket, 1 threads for each core, 16 GB of
nodes and edges but they compute also the alias probabilities                RAM
                                                                          • Node 4: Genuine Intel(R) CPU T1300 @ 1.66GHz, with
to improve the following sampling process.
                                                                             4 cores per socket, 1 threads for each core, 16 GB of
F. Random walks generation and embedding phase                               RAM
   Once the Coordinator received all of the required probabil-
                                                                        C. Results
ities, it builds the probability distribution to be sampled for
the random walks generation. It implements the biased coin                In order to compare the two versions of the algorithm, we
system to sample from the probabilities or from the alias data          choose an initial configuration of the parameters:
structures and generates the random walks from each node                  • Embedding dimensions : 100
with a fixed length, depending on the algorithm’s parameters.             • P : 1
After that, the Coordinator collects all of the random walks              • Q : 1
in a corpus and feed the Neural Network model based on                    • Walk length: 10
Word2Vec described previously. The embedding phase can                    • Numbers of walks: 10




                                                                      139
                                          Workshop "From Objects to Agents" (WOA 2019)

                           TABLE II
  R EQUIRED TIMES WITH THE INITIAL PARAMETERS CONFIGURATION

                  Probabilities and rand. walks   Embedding   Total
 Node2Vec         244.23 s                        390.87 s    635.10 s
 ActorNode2Vec    58.3 s                          69.62 s     127.92 s


                           TABLE III
       R EQUIRED TIMES WITH EMBEDDING DIMENSIONS @300

                  Probabilities and rand. walks   Embedding   Total
 Node2Vec         481.22 s                        745.16 s    1226.38 s
 ActorNode2Vec    60.28 s                         823.75 s    884.03 s
                                                                                   Fig. 7. Sensitivity analysis of the Walk length parameter


In particular with parameters P and Q equal to one, we can
assume that the DFS and BSF strategies are used uniformly in
the random walks generation. This particular choice of param-
eters enable us also a comparison with Deep Walk algorithm
where the two searching strategies are balanced. However P
and Q do not affect the algorithm’s performance. Results of
this first comparison are in Table II. In this first experiment
ActorNode2vec required about the 80 % less time to embed
the Enron network. In this first case the improvements seem to
be related both to a minus time in probabilities computation
and in the embedding phase. We have further analyzed this
                                                                                Fig. 8. Sensitivity analysis of the Number of walks parameter
case by increasing the number of embedding dimension to
300. In such way the embedding phase take several minutes
in both versions of the algorithm because it represents the
                                                                               VII. C ONCLUSIONS AND FUTURE DEVELOPMENTS
number of neurons in the hidden layer of the adopted Multi-
layer perceptron. It is also to report that training and inference           In this paper, we propose a scalable and distributed version
of the statistical model could be more efficient in Python                of the Node2Vec algorithm called ActorNode2vec, developed
with the Gensim implementation of the neural network than                 on an actor-based architecture that allows to overcome some
in Java. Results of this case are in Table III. In fact, in this          limitations in terms of memory requirements and time com-
case ActorNode2Vec took about the 28 % less than Node2Vec,                plexity when applied to large networks. Results show an
however the improvements are related only to a minus required             average reduction between the 65% and the 82% in terms
time of the actor-based probabilities computation.                        of the required computational times, while memory issues are
                                                                          overcome with the scalability of the solution. Some future
                                                                          developments are related to the second bottleneck of the algo-
D. Sensitivity analysis of the parameters
                                                                          rithm, that concerns with the time complexity of the Neural
   In light of the results presented in the previous section, we          Network adopted for the embedding phase. For example an
experimented ActorNode2Vec and Node2Vec with the same                     implementation with DeepLearning4j could improve required
initial configuration set of the parameters, but further analyzing        training times, as the choice of distributing also the training
the results by changing the number of walks and the walks                 phase. Some other future developments are related to the use of
length in a range between 1 and 50. Results concerning the                the actors also for a distributed generation of the random walks
walk length sensitivity are presented in Figure 7, while the              and the use of different strategies for the network exploration.
others concerning the numbers of walks in Figure 8. These                 Finally, other future research activities will be dedicated to
two analysis are necessary to understand how the actor-based              ActoDeS. We are working about the improvement of the
solution of ActorNode2Vec improves the time complexity of                 support for knowledge and ontology based applications [?],
the algorithm. Walk length and the number of walks are                    [24], [25] and in the development of tool that has the aim of
the main parameters that affect the probabilities computation             simplifying the design of applications [26], [27].
phase in terms of required time. In both cases, it is possible to
note that our solution requires less seconds to embed the large                                        R EFERENCES
network of interest. Results have been obtained as an average
of three executions of the two algorithms. ActorNode2Vec                   [1] G. Lombardo, P. Fornacciari, M. Mordonini, L. Sani, and M. Tomaiuolo,
                                                                               “A combined approach for the analysis of support groups on facebook-
requires on average the 65 % less time by varying the Walk                     the case of patients of hidradenitis suppurativa,” Multimedia Tools and
length and the 82 % less in the case of the number of walks.                   Applications, vol. 78, no. 3, pp. 3321–3339, 2019.




                                                                    140
                                               Workshop "From Objects to Agents" (WOA 2019)

 [2] G. Lombardo, A. Ferrari, P. Fornacciari, M. Mordonini, L. Sani, and                of social representations,” in Proceedings of the 20th ACM SIGKDD
     M. Tomaiuolo, “Dynamics of emotions and relations in a facebook group              international conference on Knowledge discovery and data mining,
     of patients with hidradenitis suppurativa,” in International Conference on         pp. 701–710, ACM, 2014.
     Smart Objects and Technologies for Social Good, pp. 269–278, Springer,        [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
     2017.                                                                              “Distributed representations of words and phrases and their composition-
 [3] L. Sani, G. Lombardo, R. Pecori, P. Fornacciari, M. Mordonini, and                 ality,” in Advances in neural information processing systems, pp. 3111–
     S. Cagnoni, “Social relevance index for studying communities in a face-            3119, 2013.
     book group of patients,” in International Conference on the Applications      [16] A. J. Walker, “An efficient method for generating discrete random vari-
     of Evolutionary Computation, pp. 125–140, Springer, 2018.                          ables with general distributions,” ACM Transactions on Mathematical
 [4] A. Kocheturov, M. Batsyn, and P. M. Pardalos, “Dynamics of cluster                 Software (TOMS), vol. 3, no. 3, pp. 253–256, 1977.
     structures in a financial market network,” Physica A: Statistical Me-         [17] E. Franchi, A. Poggi, and M. Tomaiuolo, “Blogracy: A peer-to-peer
     chanics and its Applications, vol. 413, pp. 523–533, 2014.                         social network,” in Censorship, Surveillance, and Privacy: Concepts,
 [5] P. Radivojac, W. T. Clark, T. R. Oron, A. M. Schnoes, T. Wittkop,                  Methodologies, Tools, and Applications, pp. 675–696, IGI Global, 2019.
     A. Sokolov, K. Graim, C. Funk, K. Verspoor, A. Ben-Hur, et al., “A            [18] F. Bergenti, E. Iotti, A. Poggi, and M. Tomaiuolo, “Concurrent and
     large-scale evaluation of computational protein function prediction,”              distributed applications with actodes,” in MATEC Web of Conferences,
     Nature methods, vol. 10, no. 3, p. 221, 2013.                                      vol. 76, p. 04043, EDP Sciences, 2016.
 [6] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for
                                                                                   [19] G. Angiani, P. Fornacciari, G. Lombardo, A. Poggi, and M. Tomaiuolo,
     networks,” in Proceedings of the 22nd ACM SIGKDD international
                                                                                        “Actors based agent modelling and simulation,” in Highlights of Prac-
     conference on Knowledge discovery and data mining, pp. 855–864,
                                                                                        tical Applications of Agents, Multi-Agent Systems, and Complexity:
     ACM, 2016.
                                                                                        The PAAMS Collection, (Cham), pp. 443–455, Springer International
 [7] F. Bergenti, A. Poggi, and M. Tomaiuolo, “An actor based software
                                                                                        Publishing, 2018.
     framework for scalable applications,” in International Conference on
     Internet and Distributed Computing Systems, pp. 26–35, Springer, 2014.        [20] P. Fornacciari, M. Mordonini, A. Poggi, L. Sani, and M. Tomaiuolo,
 [8] B. Klimt and Y. Yang, “Introducing the enron corpus.,” in CEAS, 2004.              “A holistic system for troll detection on twitter,” Computers in Human
 [9] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,             Behavior, vol. 89, pp. 258–268, 2018.
     “The graph neural network model,” IEEE Transactions on Neural                 [21] G. Lombardo, P. Fornacciari, M. Mordonini, M. Tomaiuolo, and
     Networks, vol. 20, no. 1, pp. 61–80, 2009.                                         A. Poggi, “A multi-agent architecture for data analysis,” Future Internet,
[10] T. N. Kipf and M. Welling, “Semi-supervised classification with graph              vol. 11, no. 2, p. 49, 2019.
     convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.               [22] G. Agha, Actors: A Model of Concurrent Computation in Distributed
[11] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by                Systems. Cambridge, MA, USA: MIT Press, 1986.
     locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326,        [24] F. Bergenti, A. Poggi, M. Tomaiuolo, and P. Turci, “An ontology
     2000.                                                                              support for semantic aware agents,” in Proc. Seventh International Bi-
[12] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line:                      Conference Workshop on Agent-Oriented Information Systems (AOIS-
     Large-scale information network embedding,” in Proceedings of the                  2005@ AAMAS), Utrecht, The Netherlands, 2005.
     24th international conference on world wide web, pp. 1067–1077,               [25] A. Poggi, “Developing ontology based applications with o3l,” WSEAS
     International World Wide Web Conferences Steering Committee, 2015.                 Trans. on Computers, vol. 8, no. 8, pp. 1286–1295, 2009.
[13] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of        [26] F. Bergenti and A. Poggi, “Exploiting uml in the design of multi-agent
     word representations in vector space,” arXiv preprint arXiv:1301.3781,             systems,” in International Workshop on Engineering Societies in the
     2013.                                                                              Agents World, pp. 106–113, Springer, 2000.
[14] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning             [27] F. Bergenti and A. Poggi, “A development toolkit to realize autonomous
[23] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community                and interoperable agents,” in Proceedings of the fifth international
     structure in large networks: Natural cluster sizes and the absence of large        conference on Autonomous agents, pp. 632–639, ACM, 2001.
     well-defined clusters,” Internet Mathematics, vol. 6, no. 1, pp. 29–123,
     2009.




                                                                               141