=Paper=
{{Paper
|id=Vol-2333/paper4
|storemode=property
|title=Patient Engagement: Theoretical and Heuristic Approaches for Supporting the Clinical Practice
|pdfUrl=https://ceur-ws.org/Vol-2333/paper4.pdf
|volume=Vol-2333
|authors=Italo Zoppis,Riccardo Dondi,Sara Manzoni,Giancarlo Mauri
|dblpUrl=https://dblp.org/rec/conf/aiia/ZoppisDMM18
}}
==Patient Engagement: Theoretical and Heuristic Approaches for Supporting the Clinical Practice==
Patient Engagement: Theoretical and Heuristic Approaches for Supporting the Clinical Practice. Italo Zoppis1 , Riccardo Dondi2 , Sara Manzoni1 , and Giancarlo Mauri1 1 Department Computer Science, University of Milano Bicocca, Milano, Italy 2 Department of Letters, Philosophy, Communication, University of Bergamo, Bergamo, Italy Abstract. Social interaction allows to support the disease management by creating online spaces where patients can interact with clinicians, and share experiences with other patients. Therefore, promoting targe- ted communication in online social spaces is a mean to group patients around shared goals, offer emotional support, and finally engage patients in their healthcare decision-making process. In this paper, we approach the argument from a theoretical perspective: we design an optimization problem aimed to encourage the creation of (induced) sub-networks of patients which, being recently diagnosed, wish to deepen the knowledge about their medical treatment with some other similar profiled patients, which have already been followed up by specific (even alternative) care centers. In particular, due to the computational hardness of the proposed problem, we provide approximated solutions based on distributed heu- ristics (i.e., Genetic Algorithms). Results are given for simulated data using Erdös-Rényi random graphs. Keywords: Social Network · Distributed Heuristic · Genetic Algorithm · Dense Communities · 2-club optimization problem 1 Introduction Social media can directly support the disease management by creating online spaces where patients can interact with clinicians, and share experiences with other patients [11,27]. For example, cancer patients use Twitter to discuss treat- ments and provide psychological support [32], and online engagement seems to correlate with lower levels of self reported stress and depression [7]. Similarly, wellness programs frequently incorporate social media to create a sense of community [34], group people around shared goals, and offer social and emotional support. A trial reported that adding on line community features to an Internet-mediated wellness and walking program improves adherence, and did reduce participant attrition [24]. Nevertheless, much more work remains to be carried out for sharing targeted and optimized information content. How can we optimize a procedure which is able to facilitate the encounter between patients who want to deepen or share experiences about treatments, care points, and specialists? How to correlate, for example, similar clinical profiles, while inducing networks of medical stuff, and treated patients which offer their availability to share experiences or suggesti- ons? These are exactly the issues we deal with in this paper. In particular, we focus on the problem of creating a space of individuals and care centers, by considering the case where recently diagnosed patients could be interested to meet some other patients (experience), for sharing information on their disease and the suggested (or available) care center. In this situation, it would be useful, for example, to encourage the diagnosed subjects to socialize, and confront with the experience of other patients with similar clinical profile, who have been already followed up within the same (or even alternative) propo- sed care point. It is clear that a proper handling of procedures and data is fundamental in order to convert available information into useful formulation that leads to par- ticular (induced) communities. From a theoretical perspective this consideration can be expressed as the problem of finding dense or cohesive subgraphs (with particular properties), inside a network. Unfortunately, as we will report in the next sections, the intrinsic complexity of the considered computational problem make optimization potentially impracticable. For this reason, we design a dis- tributed heuristic (i.e., Genetic Algorithms, GAs) to seek faster approximation solutions (see, e.g., [20] for details). In particular, we propose a general frame- work, which, similarly to a Map-Reduce programming paradigm, abstracts away the complexity of learning by allowing programmers to describe their processes in terms of two main procedures, here referred as Training, and Combine. With this approach we are able to provide both distributed concept learning and a proper delivering of the trained models on different nodes of the learning system. In Section 2 we introduce the main theoretical aspects, focusing on the com- putational hardness of the considered problem (Sections 2.2). Then, we dis- cuss the GA-based approach to seek approximated results for our formulation (Section 3). In Section 4, we extend this approach to a general distributed en- vironment. Finally, after reporting numerical experiments on simulated data (Section 5), we conclude the paper (Section 6) by discussing our results and describing future directions of this research. 2 Main Definitions Suppose we wish to model the situation where recently diagnosed patients (shortly reported as RD patients) are fostered to deepen the knowledge of their disea- ses with some other patient experience (say, engaged patients or shortly, ED), or they need more information concerning the suggested (or even alternative) health-care centers (HC). In this case, RD patients could benefit from the so- cial interaction with similarly profiled (ED) patients, which have already been Fig. 1. An example of a 2-club consisting of 5 vertices, such that each subgraph of four vertices is not a 2-club. Assume that we remove vertex v5 ; then the graph induced by {v1 , v2 , v3 , v4 } is not a 2-club (dG′ (v1 , v4 ) = 3). followed up by specific HC centers. To this aim, it should be useful for a social platform to encourage, and optimize, the creation of a sub-network from the available social data. From a theoretical point of view, a network is most com- monly modeled using a graph which represents relationships between objects, V (vertices), through a set of edges, E. In this way, our goal can be formulated by maximizing, within a defined graph, a cohesive sub-graph (i.e., by seeking the largest cohesive sub-graph) with particular properties (as we will detail in the following). Finding cohesive subgraphs inside a network is a well-known pro- blem that has been applied in several contexts. For example, in computational biology, dense sub-graphs are sought in protein interaction networks, as they are considered related to protein complexes [4,31]. In gene networks, dense sub- graphs are generally applied to detect relevant co-expression clusters [30]. While a classical approach to compute dense sub-graphs is the identification of cliques (i.e., complete sub-graphs induced by a set of vertices which are all pairwise connected by an edge), this definition is often too stringent for particular ap- plications. This is the case, when the knowledge on how an individual (vertex) is embedded in the sub-network (e.g.,, some vertices could act as “bridges” be- tween groups, as in our case) is a critical issue to take into account. Therefore alternative definitions of cohesive sub-graphs can be introduced, for example by relaxing some constraints, leading to the concept of relaxed clique [17]. Here, we follow this approach by relaxing the definition of distance between vertices. In a clique distinct vertices are at distance of 1, in our case, vertices can be at distance of at most s = 2. A sub-graph where all the vertices are at distance of at most 2 is called a 2-club (or, more in general, s-club for different values of s). 2-clubs have been extensively applied to social networks analysis [21,1,19,22], and biological network analysis, e.g., protein-protein interaction networks, [23]. Note that, when s = 1, a 1-club is exactly a clique. An example of 2-club with 5 vertices is represented in Figure 1. 2.1 Problem formulation Consider a graph G = (V, E), and a subset V ′ ⊆ V . We denote by G[V ′ ] the subgraph of G induced by V ′ . Formally G[V ′ ] = (V ′ , E ′ ), where E ′ = {{u, v} : u, v ∈ V ′ ∧ {u, v} ∈ E}. Given a set V ′ ⊆ V , we say that V ′ induces the graph G[V ′ ]3 . The distance dG (u, v) between two vertices u, v of G, is the length of a shortest path in G which has u and v as endpoints. The diameter of a graph G = (V, E) is maxu,v∈V dG (u, v), i.e., the maximum distance between any two vertices of V . In other words, a 2-club in a graph G = (V, E) is a sub-graph G[W ], with W ⊆ V , that has diameter of at most 2. Moreover, given a vertex v ∈ V , we define the set N (v) as follows N (v) = {u : {v, u} ∈ E} N (v) is called the neighborhood of v. Finally, we denote by N [v] the closed neig- hborhood of v, where N [v] = N (v) ∪ {v}. We will formulate the problem using 2-clubs whose shortest path connecting RD patients with specialist centers/staff (e.g., care center, hospital, or clinical staff) has to “transit” through, at least one EP patient who has already been followed up by the considered specialist “point”. Formally, for any pair, (d, h), composed by the recently diagnosed patient, d, and, e.g., the health care center, h, the social platform should suggest for the patient d, to compare (even to meet) with an identified (available) patient x’s experience. More specifically, we are currently seeking (within the input “social graph”) a 2-Club, G[D∪X ∪H], where D, X and H represent the sets of recently diagnosed patients, experienced patients, and care point centers. respectively. Please note that, when such a structure (i.e., a maximum size 2-clubs) exists, within the identified starting social space, then for any pair of vertices, it must exist at least one simple path of length 2, i.e., a path composed by a triple of vertices. This, in turn, will be also true for any pair, (d, h) where d ∈ D, h ∈ H. Indeed, our goal will be to find a largest-size 2-clubs which has the further property of providing, for any pair (d, h), a shortest path characterized by the triple of vertices (d, x, h) ∈ (D × X × H). In this case, the set of edges, modeling the starting social network, will be defined as follow. – Edges between similar profiled patients. – Edges expressing that an experienced patient x, has already been followed up from the care center, h. In this case the edges in X ×H will be constructed by knowing both the clinical history of each (experienced) patient, x, and the clinical staff or hospital h, which has already properly followed up the patients, x. – Edges between vertices h1 , h2 ∈ H, for example because two care centers are similar (have similar services or are part of the same institution). In this situation, the simple path given by the triple of vertices (d, x, h) ∈ (D × X × H) in the 2-club G would suggest for patient d ∈ D to contact the patient, x ∈ X, about the health care center (or specialist staff), h ∈ H. For 3 Notice that all the graphs we consider are undirected. sake of clarity, before defining computationally the problem, we refer to any pair of vertices, (d, h) ∈ D × H (such that the minimum path connecting d to h is given by the vertex sequence (d, x, h), for any x ∈ X), as a ”feasible pair”. Considering the above discussion, we can define the following variant of the 2-clubs maximization problem. Problem 1. Maximum 2-Club (Max-2-club) Input: a graph G = (D ∪ X ∪ H, R ∪ F ). Output: a set V ′ ⊆ D ∪ X ∪ H such that G[V ′ ] is a 2-club having maximum size, and for each pair of vertices (d, h) ∈ D × H in G[V ′ ], a minimum path connecting d to h is given by the vertex sequence (d, x, h) for some x ∈ X (i.e., (d, h) is feasible). 2.2 Computational hardness The complexity of the problem of Maximum s-club has been extensively studied in literature, and unfortunately it turns to be NP-hard for each s ≥ 1 [9]; Max- imum s-Club is NP-hard even if the input graph has diameter s + 1, for each s ≥ 1 [5]. The same property holds for our variant of Maximum 2-club. Indeed, the computation of a 2-club of maximum size containing a specific vertex v is also NP-hard. By defining D = {v}, X = N (v) and H the remaining set of vertices, it follows that the ”feasibility” property holds. Given an input graph G = (V, E), Maximum s-club is not approximable within factor |V |1/2−ε , for any ε > 0 and s ≥ 2 [3]. On the positive side, polynomial- time approximation algorithms [3] have been given, with factor |V |1/2 for every even s ≥ 2, and factor |V |2/3 for every odd s ≥ 3. The parametrized com- plexity of Maximum s-Club has also been studied, leading to fixed-parameter algorithms [28,18,10]. Maximum 2-Club has been considered also for specific graph classes [16,15]. 3 A Genetic Algorithm The complexity (and approximation) complexity of the problem introduced so far make optimization potentially impracticable. For this reason, we designed a Genetic Algorithm (GA) to seek faster approximation solutions see, e.g., [20] for details. In particular, given an input graph G = (V, E), the GA represents a solution (a subset V ′ ⊆ V such that G[V ′ ] is a 2-club of G) as a binary chro- mosome c, of size n = |V |, such that for all vi ∈ V , c[i] is either 1 or 0. Note that, with a slight abuse of notation, we will denote by G[c] the subgraph of G induced by the representation of chromosome c. Similarly, V [c] and E[c] will denote the set of vertices (V ′ ) and edges of G[c] = G[V ′ ]. During the offspring generation, chromosomes are interpreted as hypotheses of feasible solutions (i.e., they may represent unfeasible solutions, for example an s-club, with s > 2, or a disconnected graph, during the first phases of evolu- tion: they can later evolve in feasible solutions), undergoing to mutation, cross- validation and selection. Chromosome evaluation (i.e., hypothesis on a potential 2-club) is then provided, as usually, through the fitness function. 3.1 Fitness definition In this paper, fitness is designed to promote adaptation in such a way that new candidate chromosomes, able to represent graphs with a feasible (correct) diame- ter value (i.e., not grater than 2) will “evolve” through the elitism mechanism. In this way, a small proportion of fittest candidates (chromosomes with high fitness values) are selected and preserved unchanged in the next generation. Specifically, given a chromosome c, and an input graph G = (V, E), the fitness realizes such a mechanism using the following quantities. 1. An estimation of the number of the pairs of vertices, (vi , vj ) ∈ V [c]×V [c], for which at least one shortest path (between vi and vj ) of maximum distance 2 exists (in the considered chromosome representation G[c].). For a consistent discussion, we will refer to such a collection of pairs (vi , vj ) as the Feasible Set (FS), C ⊆ V [c] × V [c]. 2. The number of vertices, nV , of the (sub)graph, G[c], induced by c. In particular, for any chromosome c, we observed a sample (of dimension n) S ⊆ C, and by considering the induced subgraph representation G[c], we evaluated the following fitness { (1 − np /nS )nv if 0 ≤ diam ≤ 2 ; f (nv ; diam) = (1) (1 − np /nS ) n1v if 2 < diam , where: – nv is the (set of vertices) cardinality of G[c] induced (speculated) by c; – np /nS is the (simple plug-in sample) proportion ∑ n n−1 I((vi , vj )k ∈ / C), k=1 of the (“unfeasibile”) set of pairs, (vi , vj ), whose shortest paths have dimen- sion greater than 2, and – { 1 if (vi , vj )k ∈ /C; I(vi , vj )k = (2) 0 otherwise. In this way, by using the frequency, (1 − np /nS ), of the sampled feasible pairs, the fitness weights (when 0 ≤ diam ≤ 2) the number of vertecies nv in G[c], thus promoting large (sub)graph (size). On the other hand, given diam > 2 (i.e., unfeasible solutions) and (1−np /nS ), we have fitness values which decrease asymptotically as n1v , for large (sub)graph size, nv , thus penalizing the corre- sponding chromosome. 3.2 Search operators To provide adaptation (with regard to the Maximum 2-club problem), we rede- fined the following standard operators. – Mutation (applied with probability 0.75). Mutation must be carefully consi- dered. Indeed, vertex mutations can affect not only the vertex involved in the current operations, but can even modify other parts of the solution. For this reason we defined two further mutation operators, both with the objective to correct hypotheses (i.e., chromosomes) consistently and parsimoniously. More specifically three type of mutations are considered. • Base Mutation (applied with probability 0.5, given that mutation has been chosen). Similarly to the standard case, each individual from the current population at time i is modified with a given probability (see details in the experimental section). In this case, mutation flips a bit of the selected chromosome c, in such a way that the corresponding vertex is either removed (i.e., bit flipped to 0) or added (i.e., bit flipped to 1) to the solution induced by c. Note that eliminating or adding vertices can induce subgraphs that actually do not have feasibility, i.e., they are not 2-clubs, since the property of being a 2-club is not hereditary. Moreover, modifications of chromosomes introduce the possibility of overcoming local minimum. • Non Standard Mutation 1 (applied with probability 0.25, given that mu- tation has been chosen). This modification has the objective to correct hypotheses (i.e., chromosomes) consistently and parsimoniously. Since any chromosome, by representation, induce a sub-graph G[V ′ ] of G, which in turn may reflects feasible solutions, such hypotheses are parti- ally verified using the following principle. Given a selected chromosome c, a vertex v ′ is (randomly) sampled from the set V+ = {vi : c[i] = 1} and the minimum length of simple paths connecting every pair (vi , v ′ ), vi ∈ V+ is checked to be consistent with the chromosome representation, i.e., since each chromosome “speculates” a feasible 2-club, for such hypothe- sis to be true, there must be, at least, a simple path of size at most equal to 2 connecting any vi ∈ V+ with v ′ . If a negative feedback is observed after this verification, then the sampled vertex v ′ is flipped to 0. • Non Standard Mutation 2 (applied with probability 0.25, given that mutation has been chosen). This modification has the objective to incre- ment (parsimoniously) the size of a solution. In this case, given a selected chromosome c a vertex v ′ is sampled from V− = {vj : c[j] = 0} and the minimum length of simple paths connecting every pair (vi , v ′ ) (vi ∈ V+ ) is checked to be consistent with the current representation of the chro- mosome c. In this case, we consider to extend the hypothesis represented by c, by adding v ′ to V+ if the minimum distances from v ′ to vertices of V+ are not larger than 2. – Cross-over (applied with probability 0.25). The following operations are pro- vided. • Standard cross-over (applied with probability 0.5, given that crossover has been chosen). Offspring is generated by copying and mixing parts of parents’ chromosomes. In this case a standard one-point crossover is implemented, and applied by the GA with probability 0.25. • Logical AND between parents (applied with probability 0.25, given that crossover has been chosen). This operation has the objective to provide an offspring consistent with the selected parents. For this, pairs of chro- mosomes are generated through logical AND operations between the ascendents. This operator i chosen with probability 0.5 (conditioned by the application of Logical Cross-over • Logical OR between parents (applied with probability 0.25, given that crossover has been chosen). This operation has the objective to provide offspring extending parent hypotheses. Extension is given by realizing a logical OR operation between two selected parents. – Elitist selection (or elitism) In order to guarantee that solution quality does not decrease from one generation to another [6], best (25%) hypotheses (high fitness values) are allowed to be part of a new offspring. 3.3 Computational Cost of the Search Operators Mutation and cross-over do not affect the computational cost of the GA evolu- tion. As discussed above, given a selected chromosome c, a vertex v ′ is (rand- omly) sampled from V+ = {vi : c[i] = 1} and the minimum length of simple paths connecting every pair (vi , v ′ ), vi ∈ V+ is checked to be consistent with the chromosome representation. For this, we intersect, for each vi ∈ V+ , the closed neighborhoods N [vi ] and N [v ′ ]. If an empty set is obtained (i.e., any vertex ad- jacent both to vi and v ′ exists), then the corresponding i-th bit in v ′ , is flipped to 0, this way, thus not increasing the complexity of the procedure. A similar argu- ment applies when switching a bit from 0 to 1 (i.e., mutation 2; as defined above). Conceptually, crossover is “easier” than mutation: it permits the generation of new chromosomes by implementing simple logical AND (respectively, OR) operations (as defined previously), which still do not affect, computationally, the process. Finally, by observing the fitness formulation, we note that its complexity is mainly due to the diameter evaluation, which is known to be bounded by O(|V |3 ), thus preserving the tractability of the whole procedure. 4 Distributed learning As reported above, the huge amount of data obtainable from social interaction activities, and the intrinsic complexity of the proposed computational problem make practically untreatable the optimization goal. An effective distribution of Fig. 2. The distributed GA enviroment: Data are “mapped” over different sites and best fitted chromosomes are combined into lower level population. the computational load over different processors is therefore desirable. In this section, we focus on how to obtain an efficient load distribution, by providing both parallel concept learning and a proper delivering of the trained (i.e., evolved) models on different nodes of a learning system. To this concern, we design a general framework, which applies the previous GA approach, and similarly to the Map Reduce programming paradigm, abstracts away the com- putational complexity of “learning” by allowing programmers to describe their processes in terms of two main procedures, which we will call Training, and Combine. 4.1 A genetic cascade model Learning is practically based following the process represented in Figure 2. Dif- ferent nodes (referred as “sites”) realize, locally, the training phases by imple- menting the GAs described in Section 3. Beginning at the highest level, and progressing towards the lowest level, information on models, trained by the hig- her sites, are combined (equivalently, following a Map-Reduce “logic”, models are “reduced”) into sites of lower levels. The following functions are provided by the framework. 1. Init. To initialize both the training parameters and the number of first le- vel sites. Init realizes the “data mapping” over the first level sites, i.e., it distributes the input data for the first training phase. 2. Training. Sites read local data as mapped by the Init function, and provi- des the first level models (i.e., best fitted chromosomes): GAs are executed by using the process described in the Training function. Similarly to an ensemble learning systems, sites are trained independently by applying the approach described previously. 3. Combine. Sites are used as filters. Following a boosting approach, the i-th level in Figure 2 attempts to create more accurate and general levels (i.e., i+ 1) of solutions by inserting into site si+1,k ’s population (i.e., suggesting to site k of level i + 1) the best chromosome (solutions) obtained by both “parents” si,k and si,k+1 . In other terms, a lower level population is initialized with parents’ best chromosomes. Large scale optimization is divided into independent, local smaller optimiza- tion problems, where best fitted chromosomes are distributed among different sites of the system. Best solutions are finally “reduced” (i.e., Combined ) into the lowest level chromosome population, in a hierarchical fashion. Moreover, wit- hout any loss of generality, by assuming that patients and structures (i.e., input data and care centers) are initially ordered, each site provides local solutions (if any) on the corresponding input subnetwork. This process is detailed in Figure 3. In this case, since best fitted chromosomes from site A1 can clearly represent (speculate) feasible solutions from its input data only, one can obtain from site A1 a partial community (solution) referred to the associated input subnetwork A1, or equivalently corresponding to the respective adjacency (sub)matrix. Si- milarly, from site A2 input data, we can obtain a feasible local community for the corresponding subnetwork A2. The Combine operation extends the analysis to obtain feasible chromosomes for the subnetwork B1: local solutions represen- ted by best chromosomes of site A1, and those from site A2 are combined and inserted into the new population of site B1. The new offspring and the obtai- ned best chromosomes representation will provide optimized feasible solutions (if any) for the corresponding extended input data, which in turns represent a larger local solution suggested by the (“parents”) A1 and A2 sites. Finally the lowest level site completes the evolution (for all the input data), starting from an optimized (suggested) population. Its solutions correspond (represent) to 2-club communities for the global input network. 5 Numerical Experiments The genetic algorithm described in Sec. 3 was coded in R using the Genetic Al- gorithm (“igraph”) package [29]. As for the cascade model, we used the standard “Multiprocessing” Pyhton package, working on local concurrency with 4 cores (i.e., the cascade model uses 4 starting sites). R and Python interfaces were ma- naged by the rpy2 (https://rpy2.readthedocs.io/en/version_2.8.x) uti- lity. Experiments are executed on Apple OSX 10.12.6; System Type MacBook Pro Retina; Processor: Intel(R), Core(TM) i5-6360U CPU @ 2.00GHz, 3.100 Ghz, 2 Core(s), 4 Logical Processors; Installed Physical Memory (RAM) 8,00 GB. Results are given for synthetic data, by sampling Erdos-Renyi random graphs ER(n, p) with different values of number of vertices, n, and probability, p, to create edges between (pair of) vertices [8]. Numerical experiments have the main objective to evaluate both the effectiveness of the distributed learning and the capability of the GAs to obtain correct solutions in a reasonable time. Fig. 3. Distributed GAs enviroment. Data are “mapped” over different sites and best fitted chromosomes combined in lower levels. To provide correctness at “reasonable” cost, we followed the standard practice of the evolutionary algorithms: keep the tractability of the search operators and the fitness, and promote, at the same time, evaluable chromosomes, which in our case, provide feasible input diameter values. Moreover, a widely applied principle for termination has been applied: we set up both the number of re-evaluation of the fitness over new populations (equivalently, the number of the GA itera- tions), and the number of consecutive generations without improvement of the best fitness value4 . Note that to check the robustness of the solutions, each ER model has been sampled iteratively 5 times. The corresponding standard devia- tion, obtained in each experiment, will be reported together with the obtained performances in the following section. 5.1 Results The results are reported in Tables 1 and 2. In particular the following attributes are considered. – Input and output diameters. The best GA solution is re-coded, and the resulting graph diameter is reported. – Number of Output vertices. Input vertices are given by the first parameter, n, of the random model ER(n, p) – Fitness value. Fitness is described in Sec. 3. – Ratio between the number of the input graph vertices and the number of vertices of the resulting 2–club. 4 See, e.g. [26], for a critical analysis of various aspects associated with the specification of termination conditions in a simple genetic algorithm. Table 1. Models (Erdos-Renyi). Input Diameter (InD), Output Diameter (OutD), Output Nodes (OutN), Output Feasible Pairs (OutP). Centralized Execution Models InD OutD OutN OutP ER(150,0.1) 3.3 (0.48) 2 (0) 18.5 (3.53) 25.5 (21.2) ER(150,0.2) 3 (0) 2 (0) 52.8 (38.3) 387.4 (633.6) ER(500,0.1) 3 (0) 3.4 (3.3) 16.9 (9.6) 34.9 (36) ER(500,0.2) 2 (0) 2 (0) 430.3 (68.5) 20840.9 (6164) ER(1500,0.1) 2.43 (0.51) 2 (0) 105.4 (52.6) 1278.5 (983.3) Distributed Execution ER(150,0.1) 3.2 (0.42) 2 (0) 18.8 (3.12) 28.9 (18.3) ER(150,0.2) 3 (0) 2 (0) 53.3 (25.6) 294.6 (298.9) ER(500,0.1) 3 (0) 2.6 (1.1) 42.6 (49) 183.5 (496.3) ER(500,0.2) 2 (0) 2 (0) 346.6 (110.3) 14519.9 (9588.9) ER(1500,0.1) 2.25 (0.46) 2 (0) 104.6 (22.7) 1180.8 (645.3) Table 2. Models (Erdos-Renyi), Best Fitness (Fit), Iteration, Cpu Av. Time., Ratio between input and output Vertices. Centralized Execution Models Fit iter Time Ratio ER(150,0.1) 18.5 (3.53) 88.6 (14.9) 43.47 (14.9) 8.10 ER(150,0.2) 52.8 (38.3) 267.6 (147.3) 263.8 (120.1) 2.84 ER(500,0.1) 16.9 (9.6) 64.74 (27.32) 295.83 (163.4) 29.6 ER(500,0.2) 430.3 (68.5) 46.09 (70.86) 910.37 (1580.29) 1.16 ER(1500,0.1) 105.4 (52.6) 408.43 (235.08) 1171.79 (1056.6) 14.23 Distributed Execution ER(150,0.1) 18.8 (3.12) 20.2 (5.2) 47.50 (21.7) 7.97 ER(150,0.2) 53.3 (25.6) 59.8 (32.4) 77.56 (32.13) 2.81 ER(500,0.1) 42.6 (49) 13.29 (10.57) 119.8 (233.65) 11.73 ER(500,0.2) 346.6 (110.3) 4.33 (6.40) 107.89 (522.38) 1.44 ER(1500,0.1) 104.6 (22.7) 112.5 (10.35) 234.51 (247.24) 14.34 – Average CPU User Time per level in seconds. In case of a single processing unit, i.e., standard GA evolution, this quantity corresponds to the GA exe- cution time. When the distributed process is considered this time is averaged by the number of framework levels. – Early stopping for no improvement. The number of consecutive generations without improvement in the best fitness value before the GA is stopped. – Max Number of Generation. The maximum number of iterations to run before the GA search is halted. – Final generation number. Iteration number associated to the obtained solu- tion. Note that for the distributed evolution this number corresponds to the iterations of the lowest site. The following main considerations emerge from the results. – In all model, execept ER(500, 0.1), we obtained feasible 2-clubs (with null standard deviation). – As for the considered network: we are able to find combinatorial structures which actually requires impractical computational time. In particular, this is interesting for those solutions which have been obtained for a large input graph (e.g., networks with more than 1000 vertices). – It is clear that, due to the complexity of the problem, we cannot compare the size of the obtained 2-club community with the size of an optimal solu- tion. Therefore, to give a qualitative idea of the solutions, we reported the ratio between the vertices of the input and the output graphs. Note that, the approximation complexity for the Maximum 2-club shows that the problem is not approximable within factor |V |1/2−ε , for each ε > 0 [3]. Thus, even for the approximability, the problem is very hard to approach. Observing our results, and in particular the ratio in Table 2, we find that, the most com- pelling solutions are those for which, the larger is the input graph (number of vertices), the lower is the difference between the ratio and the unit. – While the size of the inferred communities does not seem to differ, the (average) number of iterations of the last level site, for the distributed case, is much lower than the iterations reported by the standard, centralized evo- lution. This is also evident from the decrease in the average execution time per level, reported by the distributed case. Since GA uses the same parame- ters for the standard and the distributed evolution, this behavior can surely be traced back to the initial suggestions that each lower level site receives from the higher level. – Computational cost seems to depend by the edge number (i.e., high expected connectivity of the random models E(n, p)) as well as the number of the input vertices. As discussed for the fitness 3.3, this assertion can be easily justified by considering the diameter computational cost, which is known to be bounded by O(|V |3 ). 6 Conclusions While GAs optimization technique is not new in literature, a new design of these models is now needed to cope with the hardness of many computational problems, which actually find new applications in many contexts [13,14]. In this paper we focused on the optimization of a social network of patients which aim to deepen the knowledge about the available care centers for their pathologies through the help of other “experienced” patients. We considered this problem from a computational point of view by defining a variant of the max 2-club pro- blem. Due to computational complexity of the defined formulation, we provided GA- based heuristics on a distributed environment. The main advantage of this ap- proach can be shortly summarized as follows. – Similarly to the Map Reduce programming paradigm, the genetic cascade model described in Section 4.1 abstracts the computational complexity of learning by allowing programmers to describe the learning process in terms of Training, and Combine functions. – Every site (as described in Section 4.1) provides local learning over the consi- dered input graph. In fact, best fitted chromosomes generated by each GA’s site represent local sub-graphs solutions. In this way, learning can be ad- dressed to obtain feasible communities (i.e., 2-clubs) for specific issues: e.g., local optimized sub-networks characterized by particular (type of) patients or care centers. – Large scale data optimization problems can be divided into independent, smaller optimization tasks. In this case, only local solutions are distributed downwards the system to create more accurate and general site level of computation. – The proposed distributed learning environment is able to decrease the stan- dard GA evolution time thanks to a series of appropriate initial suggestions, organized in a hierarchical manner. – Finally, by considering a different learning algorithm, we can easily generalize the approach, simply by properly coding the Training function provided by the framework. It is clear that, like in any problem solving paradigm, the question whether the considered approach reaches optimal solutions cannot be unconditionally positive expected. In fact, while the attempts to establish a theoretical account of GA’s operations have proved more difficult in literature [33,12,25,2], the cor- rectness of the solution is one of the desirable, yet not the only crucial issue (another is, for example, the time needed to reach correctness.). Practice indicates that self-adaptation is a powerful tool to provide correct solutions: new chromosomes (i.e., solution representations) are generated by pro- perly mutating, and/or recombining old solutions, and evaluable candidates (i.e., chromosomes) can survive (i.e., elitism) in a new population which, in turn, is immediately re-evaluated through the fitness function. Similarly, the number of re-evaluations is one of the most commonly used practical measure for asses- sing whether “efficiency”, is (practically) achieved with reasonable cost. The approach given in this paper follows such a standard practice: while keeping the tractability of the search operators, i.e., mutation and crossover (as detailed in Sec. 3.3), the fitness (re)evaluation promotes correct populations by guaran- teeing, in our case, feasible diameter values (as discussed in Sec. 3.1). Many different extensions can be provided for this work. First of all, while a detailed comparison, between the presented approach and others comparable alternatives, is important for evaluating the practical conclusions of our paper, here, to the best of our knowledge, there are no state-of-the-art works to consi- der, for such a critical analysis. This question will be an issue which one should be consider for a further future (literature) research, in detail. Moreover, the distributed architecture reported in Section 4.1 was mainly introduced to help the combination and flow of information from higher level sites to lower level ones. In this paper, to speed up the experiments on laptops, we have used only a reduced number of levels and sites. Although this architecture is easily scalable to many different levels and sites, it is clear that, further architectures can be similarly applied. This is another interesting issue for a future research in this context. Finally, it is important to emphasize that the work described in this paper has to be considered, as discussed in Section 1, a means to facilitate and pro- mote the patient engagement. It is not absolutely intended as a constrain to the spontaneous nature of the communication and interaction activity, which characterize the freedom of a social networks. References 1. R. D. Alba. A graph-theoretic definition of a sociometric clique. Journal of Mat- hematical Sociology, 3:113–126, 1973. 2. Lee Altenberg. The schema theorem and prices theorem. Foundations of genetic algorithms, 3:23–49, 1995. 3. Yuichi Asahiro, Eiji Miyano, and Kazuaki Samizo. Approximating maximum diameter-bounded subgraphs. In LATIN 2010: Theoretical Informatics, 9th La- tin American Symposium, Oaxaca, Mexico, April 19-23, 2010. Proceedings, pages 615–626, 2010. 4. Gary D. Bader and Christopher W. V. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4:2, 2003. 5. Balabhaskar Balasundaram, Sergiy Butenko, and Svyatoslav Trukhanov. Novel approaches for analyzing biological networks. J. Comb. Optim., 10(1):23–39, 2005. 6. Shumeet Baluja and Rich Caruana. Removing the genetics from the standard genetic algorithm. In Machine Learning, Proc. of the 12th Int. Conf. on Machine Learning, Tahoe City, California, USA, pages 38–46, 1995. 7. Christopher E Beaudoin and Chen-Chao Tao. Modeling the impact of online cancer resources on supporters of cancer patients. New Media & Society, 10(2):321–344, 2008. 8. B. Bollobas. Random Graphs. Cambridge University Press, 2001. 9. Jean-Marie Bourjolly, Gilbert Laporte, and Gilles Pesant. An exact algorithm for the maximum k-club problem in an undirected graph. European Journal of Operational Research, 138(1):21–28, 2002. 10. Maw-Shang Chang, Ling-Ju Hung, Chih-Ren Lin, and Ping-Chen Su. Finding large k-clubs in undirected graphs. Computing, 95(9):739–758, 2013. 11. Enrico Coiera. Social networks, social media, and social diseases. BMJ: British Medical Journal (Online), 346, 2013. 12. Pierre Collet and Jean-Philippe Rennard. Stochastic optimization algorithms. arXiv preprint arXiv:0704.3780, 2007. 13. R. Dondi, G. Mauri, and I. Zoppis. Orthology correction for gene tree re- construction: Theoretical and experimental results. Procedia Computer Science, 108:1115–1124, 2017. 14. Riccardo Dondi, Giancarlo Mauri, and Italo Zoppis. Clique editing to support case versus control discrimination. In Intelligent Decision Technologies 2016, pages 27– 36. Springer, 2016. 15. Petr A. Golovach, Pinar Heggernes, Dieter Kratsch, and Arash Rafiey. Finding clubs in graph classes. Discrete Applied Mathematics, 174:57–65, 2014. 16. Sepp Hartung, Christian Komusiewicz, and André Nichterlein. Parameterized algo- rithmics and computational experiments for finding 2-clubs. J. Graph Algorithms Appl., 19(1):155–190, 2015. 17. Christian Komusiewicz. Multivariate algorithmics for finding cohesive subnet- works. Algorithms, 9(1):21, 2016. 18. Christian Komusiewicz and Manuel Sorge. An algorithmic framework for fixed- cardinality optimization in sparse graphs applied to dense subgraph problems. Discrete Applied Mathematics, 193:145–161, 2015. 19. Steven Laan, Maarten Marx, and Robert J. Mokken. Close communities in social networks: boroughs and 2-clubs. Social Netw. Analys. Mining, 6(1):20:1–20:16, 2016. 20. Melanie Mitchell. An introduction to genetic algorithms. Complex adaptive sys- tems. MIT press, Cambridge (Mass.), 1996. 21. Robert Mokken. Cliques, clubs and clans. Quality & Quantity: International Journal of Methodology, 13(2):161–173, 1979. 22. Robert J. Mokken, Eelke M. Heemskerk, and Steven Laan. Close communication and 2-clubs in corporate networks: Europe 2010. Social Netw. Analys. Mining, 6(1):40:1–40:19, 2016. 23. Srinivas Pasupuleti. Detection of protein complexes in protein interaction networks using n-clubs. In Evol. Comput., Machine Learning and Data Mining in Bioinf., 6th EvoBIO 2008, Naples, Italy, March 26-28, 2008, pages 153–164, 2008. 24. Caroline R Richardson, Lorraine R Buis, Adrienne W Janney, David E Goodrich, Ananda Sen, Michael L Hess, Kathleen S Mehari, Laurie A Fortlage, Paul J Re- snick, Brian J Zikmund-Fisher, et al. An online community improves adherence in an internet-mediated walking program. part 1: results of a randomized controlled trial. J. of Med. Internet Res., 12(4), 2010. 25. Günter Rudolph. Convergence properties of evolutionary algorithms. Kovac, 1997. 26. Martı́n Safe, Jessica Carballido, Ignacio Ponzoni, and Nélida Brignole. On stop- ping criteria for genetic algorithms. Advances in Artificial Intelligence–SBIA 2004, pages 405–413, 2004. 27. Eugenio Santoro, Gianluca Castelnuovo, Italo Zoppis, Giancarlo Mauri, and Fran- cesco Sicurello. Social media and mobile applications in chronic disease prevention and management. Frontiers in psychology, 6, 2015. 28. Alexander Schäfer, Christian Komusiewicz, Hannes Moser, and Rolf Niedermeier. Parameterized computational complexity of finding small-diameter subgraphs. Op- timization Letters, 6(5):883–891, 2012. 29. L. Scrucca. GA: A package for genetic algorithms in R. Journal of Statistical Software, 53(4):1–37, 2013. 30. Roded Sharan and Ron Shamir. Center CLICK: A clustering algorithm with appli- cations to gene expression analysis. In Proc. of the Eighth Int. Conf. on Intelligent Syst. for Molecular Biol., August 19-23, 2000, La Jolla / San Diego, CA, USA, pages 307–316, 2000. 31. V. Spirin and L. A. Mirny. Protein complexes and functional modules in molecular networks. Proc. of the Nat. Academy of Sciences, 100:12123–12–128, 2003. 32. Atsushi Tsuya, Yuya Sugawara, Atsushi Tanaka, and Hiroto Narimatsu. Do cancer patients tweet? examining the twitter use of cancer patients in japan. Jour. of mMed. Internet research, 16(5), 2014. 33. Thomas Weise, Michael Zapf, Raymond Chiong, and Antonio J Nebro. Why is optimization difficult? In Nature-Inspired Algorithms for Optimisation, pages 1– 50. Springer, 2009. 34. Italo Zoppis, Giancarlo Mauri, Ferancesco Sicurello, Eugenio Santoro, Giada Pie- trabissa, and Gianluca Castelnuovo. Diabesity: A study for mhealth integrated solutions. In International Conference on Wireless Mobile Communication and Healthcare, pages 195–199. Springer, 2016.