Madeline Hundley and Roman Yampolskiy MAICS 2017 pp. 77–84 Shortest Total Path Length Spanning Tree via Wisdom of Artificial Crowds Algorithm Madeline V. Hundley, Roman V. Yampolskiy The wisdom of crowds [7] describes the phenomenon where Abstract—This paper presents a hybrid genetic algorithm (GA) aggregated results from a population, or “crowd,” approach the with Wisdom of Artificial Crowds (WoAC) approach to solving an best answer to a problem, even though individual answers may NP-hard problem. This is a novel approach to solving Shortest have significant differences and deviations from the globally total-Path-length Spanning Tree (SPST) problems. In our tests this approach achieved results up to 12% better than use of the optimum solution. The Wisdom of Artificial Crowds makes genetic algorithm alone. use of the artificially generated populations from the genetic algorithm, treating these solutions as the “crowd” whose Index Terms—genetic algorithm, shortest total path length expertise can be consulted. One child per population is created spanning tree, wisdom of artificial crowds by aggregating the edge choices of every SPST solution, probabilistically selecting a number of those edges to base the I. INTRODUCTION child on, then building the remainder of the spanning tree T HIS project addresses the NP-complete Shortest total- Path-length Spanning Tree (SPST) problem, proposed and discussed in [1-3]. For an edge-weighted, undirected graph, the solution via locally greedy edge choices. The program is written in Python 3.5, the graphics are objective of this problem is to find the spanning tree for which generated with the module pygame, and the graphs are the total sum of total edge weights on the paths between every generated with the matplotlib module. The input provided to pair of vertices is minimized. NP-complete problems are those the program is a data file with the vertices’ coordinates. The for whom no fast solution is known; as the size of the problem program must read in these coordinates, calculate the distance set grows the time any currently known algorithm requires to between each pair of vertices that are connected, generate a solve it increases incredibly quickly. SPST has been random initial population of spanning trees, then apply the demonstrated as one such NP-complete problem [4, 5]. genetic algorithm with wisdom of artificial crowds to generationally approach the best SPST for the graph provided. In this implementation of the problem we consider the distance between each vertex as the weight or cost of the edge. The motivation behind developing this algorithm was to Specifically, we consider the Euclidean distances between explore the interplay of the genetic algorithm component and vertices with Cartesian coordinates, calculated with the formula wisdom of artificial crowds component. We also sought to below. Therefore, the total distance of a path is the sum of the explore whether wisdom of artificial crowds is a viable, costs of each edge between each vertex in that path. The total- successful variant to the genetic algorithm for the SPST path-length of a spanning tree is the sum of the path length problem as well as whether a genetic algorithm is a viable, between every pair of vertices in the tree. successful source of good solutions for a wisdom of artificial 𝑑 = √(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2 (1) crowds component. Genetic algorithms mimic the real-world phenomena II. PRIOR WORK whereby populations of biological species reproduce children The approach of a hybrid GA plus WoAC has been shown based on the genes of the parents and natural selection steers many times as a viable approach to solving many NP-complete each generation towards superior genes [6]. The genetic problems [8-11], including the famous Travelling Salesperson algorithm we incorporated into our hybrid approach initializes problem [12]. Baseline genetic algorithms have been applied to a random population of SPST solutions then reproduces solve spanning tree problems similar to the SPST problem [13] successive generations of solutions. It analyzes the reproductive and other approximation algorithms have been applied to the fitness of each member of a population, probabilistically selects SPST problem [14-16]. However, a hybrid GA + WoAC stronger parents to reproduce, generates child solutions from algorithm is, to our knowledge, a novel approach to the SPST problem. the genes of the parent spanning trees, and sets up the next generation’s population with the strongest members from the III. PROPOSED APPROACH old population plus qualifying members from the child solutions. To solve this SPST problem, the program uses a genetic algorithm with a wisdom of artificial crowds component. The Copyright held by the authors. 77 Shortest Total Path Length Spanning Tree via Wisdom of Artificial Crowds Algorithm pp. 77–84 2 program’s first order of business is to read in the coordinates of ever required. each vertex from a file and calculate the Euclidean distances between, saving these in a dictionary for quick lookup. Its last Once all paths have been identified, the SpanningTree class order of business is to plot the best, average, and worst results function calc_length is called to sum the total lengths of these from each generation of the algorithm and to graphically paths. The calc_length function simply needs to issue lookups display the best spanning tree found, with the shortest total- to the dictionary of distances between every pair of vertices in path-length. the graph generated at the beginning of the program. The function looks up and sums the length of each edge in every Overall, the genetic algorithm drives the development of a path. The total-path-length of the SpanningTree is stored in its diverse, increasingly fit population every generation. The length instance variable. wisdom of the crowds aggregates the collective knowledge of these populations to discover better or at least comparable but Finally, the SpanningTree class has a mutate function, which diverse paths. will be discussed further in the context of the genetic algorithm, below. A. SpanningTree Class B. Genetic Algorithm This program is based on a SpanningTree class. Each To begin, the genetic algorithm sets up its initial population instance of the class has an instance variable sptree, a set of the with the get_init_pop function by generating spanning trees edges, stored as tuples, in the spanning tree. To initialize a randomly, calling the get_rand_sptree function. This function SpanningTree the class should be provided a set of unique generates a valid spanning tree by tracking the vertices that have tuples, meaning no duplicates e.g. (1,2) and (2,1). Internally the been added so far and randomly selecting an edge from one sptree variable generates and stores both of these tuples, such vertex to one not yet added. This methodology prevents it however, for easy lookups to the edges contained in this from adding an edge “backwards” and introducing a cycle. SpanningTree instance. Each successive population in the genetic algorithm is Upon initializing a new instance of a SpanningTree, the class generated by the get_next_pop function, which first produces function calc_paths calculates the path for every pair of vertices one child based on the “wisdom of artificial crowds,” discussed and stores it in the instance variable paths. It does this by using further in the next section. Then using calls to the reproduce a recursive algorithm, the recursive_expansion function. function, it reproduces all remaining children required Starting with the first vertex, the function “expands” this vertex, “genetically” from selected parents. Reproduce uses the creating a “frontier” of the vertices that can next be reached get_fitness function to judge the fitness of each potential parent given the edges constituting the spanning tree. For example, if in the current population. Better spanning trees – those with the only edges in the spanning tree that contain the vertex ‘1’ shorter total-path-lengths – are judged to have a higher are the edges (1,2) and (1,3), expanding the vertex ‘1’ would reproductive fitness and therefore have a greater probability of yield a frontier of ‘2’ and ‘3’. It then arbitrarily chooses one of being selected as parents for the next generation. When two these vertices, adds it to the current path being traced, and with trees are selected to be parents, the crossover function generates another call to recursive_expansion expands again on this new two new children from those parents’ “genes.” Every time a vertex. Important to cultivating the frontier is eliminating the child is produced there is a possibility that a mutation will occur vertices already in the current path so the algorithm does not try in its “genes,” handled by the SpanningTree class’s built-in to expand “backwards.” If the algorithm reaches a vertex whose mutate function. frontier is empty, meaning the path has hit a dead end, that vertex is taken out of the path and the next option from the In the get_fitness function, the fitness ratios of every frontier at that level is selected and expanded. In this way, the spanning tree in the population need to reflect how short their recursion searches depth-first, backing up from dead-ends and total-path-lengths are. The typical fitness ratio in a genetic expanding each vertex until the destination vertex has been algorithm tries to achieve selection of a maximum value; this found. The algorithm can stop searching upon first will simply take each element’s value out of the sum total of the encountering the destination vertex because of an important population’s values. If applied to the total-path-lengths this feature of spanning trees: there can only be one valid path would “reward” the long paths rather than the short ones. For between any two vertices. Valid spanning trees by definition do the SPST problem this function must instead seek to achieve not have cycles; if a tree contained a cycle this would offer more selection of minimum values. Therefore, this algorithm’s than one path from one vertex to another. Therefore, once the fitness ratio takes each tree’s length and subtracts it from the algorithm found a valid path it could stop searching as this was highest value in the population and from 1, before dividing by the only valid path. Finally, these paths are stored in the the sum total of these adjusted costs. Finally, it scales so these instance variable paths, a dictionary whose keys are pairs of probabilities all add up to 1, a requirement for a valid vertices, stored as tuples, and whose values are lists containing probability distribution. This method more heavily weights the each path. Only unique tuples, meaning no duplicates, are shortest total-path-lengths. It also has the side effect of stored in this dictionary as lookups in one direction are only assigning the very worst spanning tree a fitness ratio of zero, 78 Madeline Hundley and Roman Yampolskiy MAICS 2017 pp. 77–84 3 which will cull it from potential selection in the next part of the shorter total-path-length for the spanning tree. There are some algorithm. This is a perfectly acceptable, even desirable, side instances where including a longer edge instead leads to overall effect from this process. better paths. However, these locally optimal choices demonstrated through testing to often be globally optimal as The crossover function uses an experimentally-determined well. In addition, the fact the child is based in portion on a break point in each parent’s “genes,” discussed further in the parent and not wholly on local greedy decisions helps to next section, to determine the edges that will be passed on to compensate and works to retain knowledge of which long the children. One child is first set up with a portion of edges routes are favorable. taken from the “mom” parent tree and the second child is first set up with a portion of edges taken from the “dad” parent tree. After crossover has completed both kids, they are each Each of these children, currently just an incomplete set of edges, checked for mutations using the SpanningTree class’s built-in is then passed to the complete_tree function, along with the mutate function. If the test against the probability of mutation opposite parent from which it was first based on. The occurring hits, the function randomly selects a path that complete_tree function, as the name implies, completes the tree involves two edges (three vertices), discards the first edge and using only edges available in the parent provided. In this way, adds a replacement connecting the first edge straight to the the first child is first built using “mom” then completed with third. This approach allows for a test of whether going straight “dad,” and vice versa for the second child. from the first to the third vertex, rather than through the second vertex, offers better paths for the tree in terms of minimizing The complete_tree function is integral to both the genetic total-path-length. Because the selection of which two-edge path algorithm and the wisdom of artificial crowds component, to mutate is random, it is equally likely to affect “central” whose use of it is discussed in further detail in the next section. vertices with many connections and “leaf” vertices with just This function first identifies all groups that exist in the edges one. This is an advantageous approach because either situation provided. For example, if (1,2) and (2,3) are both edges in the has the potential to benefit from such a mutation, so any path set provided, the function identifies that {1,2,3} is a connected should be eligible for selection. Finally, this approach is group of edges. Then vertices that are not yet connected, that guaranteed never to introduce a cycle, maintaining the integrity have zero edges, are added as groups of one each by themselves. of the mutating spanning tree. The function’s next step is to choose and add edges which will connect these groups into a complete spanning tree. Starting at The get_next_pop function has a predetermined number of an arbitrary group called the coregroup, the function generates children it is required to generate. Children that are duplicates a list of valid edges it may choose from: all edges from a vertex of trees that are in the old population or children already in the current group to a vertex not in the group and only edges generated are not permitted, however. This is a useful feature to that are also present in the parent tree. It is important to point ensure that genetic diversity is maintained in the population. If out that only edges from an added vertex to an unadded vertex all viable parents have been used to reproduce but there aren’t are those allowed. As with the process for finding the paths in enough new children yet, random trees are generated to be used a tree or for building a random tree, selecting edges in this way as children. This option was very rarely needed, if ever, but is ensures the tree is never grown “backwards” i.e. that it is never an important feature nonetheless. In addition, the crossover rate allowed to introduce a cycle. After identifying the valid edges, is set very high, but approximately 5% of the time two the function then greedily selects the shortest one, adding that randomly-generated children are returned from reproduction. edge to the tree. It must also identify the group that vertex Again, this infuses some genetic diversity in to the population. belongs to and incorporate all that group’s vertices into the After the required number of children have been generated, the coregroup. This process is repeated until all groups are get_next_pop function replaces up to a set number of the worst incorporated, growing the coregroup across the graph until it solutions in the old population with children, but only if the includes all vertices and the child spanning tree is complete. children are superior to those being discarded. In addition to purposefully driving successively better results each There are a couple important items to note regarding the generation, it also mimics the real-world phenomenon where complete_tree function. First, the arbitrary selection of the weak members of an animal species do not survive infancy. In initial coregroup does not affect the final spanning tree. It does this way, it is an improvement both to the function and the spirit not matter in what order the groups are added to the coregroup of this genetic algorithm. because of the greedy nature of the edge selection and the fact any vertex not currently in the coregroup is allowed as the next C. Wisdom of Artificial Crowds addition. For instance, if vertex ‘1’ is in one group, vertex ‘2’ is in another, and their edge is the shortest available to join those In every population, the “Wisdom of Artificial Crowds” is two groups, it will always be the one selected whether vertex applied to generate one child. Every occurrence of each edge in ‘2’s group is incorporated first, somewhere in the middle, or the population is tallied and its percentage of the total edges is calculated. These percentages serve as the probabilities the last. Second, the greedy decision to add the shortest edge get_crowd_wisdom function uses when selecting edges the available from the parent was based on the assumption that child will contain. The more often a particular edge occurs these shortest edges available will contribute to an overall 79 Shortest Total Path Length Spanning Tree via Wisdom of Artificial Crowds Algorithm pp. 77–84 4 throughout the population, the higher the probability it will be IV. EXPERIMENTAL RESULTS selected to be part of the child. Edges that never occur in the population have a zero probability of being selected for A. Data inclusion in the child, a desirable effect in the algorithm. The data used was generated by the Concorde TSP Solver [17]. While designed to solve Travelling Salesperson Problems, Some wisdom-of-the-crowd implementations use only a it is also suitable for generating datasets for a variety of similar portion of a population to study, choosing to consult only problems. Using the Windows GUI, we could randomly answers it deems as “experts.” In our approach, we purposefully generate graphs of any number of vertices, which are saved in chose to use the entire population and consider every spanning a standardized TSP format listing the vertices numbered 1 to n tree an “expert” worth consulting. First, this is because we with corresponding coordinates. The program reads in this file wanted a comprehensive view of the population’s aggregate line by line and extracts the coordinates. Below is an example knowledge, not a narrow one. Second, it is because every for 10 vertices: spanning tree in that population has been carefully built and selected from previous populations, meaning it will contain NAME: concorde10 knowledge worth considering. Because successive populations TYPE: TSP do not allow for bad spanning trees i.e. those worse than their COMMENT: Generated by CCutil_writetsplib predecessors to ever be included, every tree in the “crowd” is COMMENT: Write called for by Concorde GUI good enough to lend its input. DIMENSION: 10 EDGE_WEIGHT_TYPE: EUC_2D The child is initialized with these selected edges and then NODE_COORD_SECTION complete_tree is called to fill out the rest of the edges. The 1 22.549020 89.029536 parent passed to complete_tree in this case is a dummy 2 23.039216 81.434599 SpanningTree initialized with every vertex pair possible for the 3 30.392157 79.324895 graph. In this way, when complete_tree is greedily selecting the 4 40.277778 80.379747 best edges to join the disparate groups of edges within the child 5 38.071895 60.759494 spanning tree, every possible connection is available as an 6 23.774510 59.704641 option. 7 25.245098 67.721519 8 30.065359 66.244726 There is an important logic check in complete_tree that does 9 36.029412 70.886076 not come into play during its use by the genetic algorithm but 10 49.264706 71.940928 plays a key role during its use by the wisdom-of-artificial- crowds component. This check is part of the initial process of sorting the child’s edges into their groups. If a group containing only two vertices, therefore representing an edge, is detected to B. Results have not just one but both of its vertices already in a group, it is The genetic algorithm alone yielded the results shown in discarded. By definition if two vertices are already in a group Table 1, all from 30 generations per execution of the together, an edge between them would introduce a cycle. Such algorithm. Figures 1 and 2 show example outputs of these situations may occur because get_crowd_wisdom selects edges results. Figures 1A and 2A graph the worst, average, and best only based on probabilities. This means it may select edges that, solutions per generation in red, blue, and green, respectively. if all included, create a cycle e.g. (1,2), (2,3), (1,3). This check Figures 1B and 2B display the resulting best spanning tree in complete_tree vets the edges of the child it has been found at the completion of the algorithm. provided. For example, the function will begin by building a group from (1,2) and (2,3) of {1,2,3}, then detect that (1,3) is Vertices Population Runtime (s) Best Length invalid and drop it. This situation does not occur during the 30 50 37.56215 33174.15536 genetic algorithm because the child passed to complete_tree in 30 50 42.87745 33454.12696 30 50 38.18318 37393.45269 that instance is based on a parent tree which is guaranteed to be 50 50 177.57016 115716.17169 valid (unless it was somehow otherwise corrupted). Therefore, 50 50 235.58247 119350.27524 if the parent does not contain any cycles then the subset of the 50 50 188.93781 125145.69157 parent used to build the child is also guaranteed to be cycle-free. 77 77 1212.46735 277132.34661 77 77 1380.45296 286723.39976 The greedy decisions of complete_tree drive the 77 77 1442.40950 271129.73645 development of optimal spanning trees. The inclusion of every Table 1: Results from Genetic Algorithm path possible in the parent in the case of the wisdom-of- artificial-crowd child helps contribute both to optimal spanning trees and a diverse population because it may include a new edge not seen anywhere else in the population but that turns out to be highly optimal. 80 Madeline Hundley and Roman Yampolskiy MAICS 2017 pp. 77–84 5 Figure 1A: Worst, Average, and Best Solutions per Generation Figure 2B: Spanning Tree of Total-Path-Length 277132.34661 The full hybrid algorithm yielded the results in Table 2, also from 30 generations per execution of the algorithm. Figures 3 and 4 show example output from the hybrid algorithm. Figures 3A and 4A graph the worst, average, and best solutions per generation in red, blue, and green, respectively. Figures 3B and 4B display the resulting best spanning tree found at the completion of the algorithm. Figure 1B: Spanning Tree of Total-Path-Length 125145.69157 Figure 2A: Worst, Average, and Best Solutions per Generation 81 Shortest Total Path Length Spanning Tree via Wisdom of Artificial Crowds Algorithm pp. 77–84 6 Vertices Population Runtime (s) Best Length 30 50 37.52615 32912.40819 30 50 39.67027 33800.07715 30 50 35.05000 34155.32438 50 50 215.91535 103152.67340 50 50 268.56736 106292.14450 50 50 226.07593 107272.21349 77 77 1723.61959 253756.81567 77 77 1423.14640 257199.90750 77 77 1375.86069 262635.91412 Table 2: Results from Hybrid Algorithm Figure 4B: Spanning Tree of Total-Path-Length 253756.81567 For 30 vertices and a population of 50, the hybrid algorithm required 5.38% less time and produced results of an average 3.03% shorter final total-path-length over the genetic algorithm alone. For 50 vertices and a population of 50, the hybrid algorithm required 18.02% more time and produced results an average 12.07% shorter than the genetic algorithm alone. For 77 vertices and a population of 77, the hybrid algorithm required 12.08% more time to execute and produced results an average 7.35% shorter than the hybrid algorithm alone. This hybrid approach was capable of finding good solutions Figure 3A: Worst, Average, and Best Solutions per Generation to larger datasets, as shown here in Table 3, though with long runtimes. Figures 5 and 6 show example outputs of these results. Figures 5A and 6A graph the worst, average, and best solutions per generation in red, blue, and green, respectively. Figures 5B and 6B display the resulting best spanning tree found at the completion of the algorithm. Ver- Popu- Gene- Runtime (s) Best Length tices lation rations 80 80 30 1976.99008 271001.70025 80 40 30 853.11680 289119.11059 100 100 20 3228.44966 507786.11792 100 20 20 702.63019 663055.61606 222 100 10 38360.50510 3912193.54795 222 20 10 8892.24261 5096472.87880 Figure 3B: Spanning Tree of Total-Path-Length 106292.14450 Table 3: Results from Hybrid Algorithm on Large Datasets Figure 4A: Worst, Average, and Best Solutions per Generation Figure 5A: Worst, Average, and Best Solutions per Generation 82 Madeline Hundley and Roman Yampolskiy MAICS 2017 pp. 77–84 7 the crowds, the resultant child would in turn drive the diversity and fitness of the genetic algorithm’s population. Therefore, the hybrid algorithm as a whole developed good results. Finally, it is also important to highlight that while not optimal, this algorithm is relatively fast. Some algorithms are incapable of finding any answer for datasets of the size tested here. Overall, as a problem-solving technique, this combined algorithm has many advantages in that it is relatively efficient, complete, and finds a good answer to the Shortest Total-Path- Length Problem for large datasets. Figure 5B: Spanning Tree of Total-Path-Length 507786.11792 REFERENCES [1] T. C. Hu, "Optimum communication spanning trees," SIAM Journal on Computing, vol. 3, no. 3, pp. 188- 195, 1974. [2] D. S. Johnson, J. K. Lenstra, and A. Kan, "The complexity of the network design problem," Networks, vol. 8, no. 4, pp. 279-285, 1978. [3] P. M. Pardalos, D. W. Hearn, and W. W. Hager, Network optimization. Springer Science & Business Media, 2012. [4] M. R. Gary and D. S. Johnson, "Computers and Intractability: A Guide to the Theory of NP- completeness," ed: WH Freeman and Company, New York, 1979. [5] K. Mehlhorn, "Selected Topics in Algorithms 2009 Figure 6A: Worst, Average, and Best Solutions per Generation NP-completeness of Minimum Fundamental Cycle Basis Problem," 2009. [6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. MIT press Cambridge, 2001. [7] J. Surowiecki, The wisdom of crowds. Anchor, 2005. [8] D. Bonomo, A. P. Lauf, and R. Yampolskiy, "A crossword puzzle generator using genetic algorithms with Wisdom of Artificial Crowds," in Computer Games: AI, Animation, Mobile, Multimedia, Educational and Serious Games (CGAMES), 2015, 2015, pp. 44-49: IEEE. [9] L. H. Ashby and R. V. Yampolskiy, "Genetic algorithm and Wisdom of Artificial Crowds algorithm applied to Light up," in 2011 16th International Figure 6B: Spanning Tree of Total-Path-Length 3912193.54795 Conference on Computer Games (CGAMES), 2011. [10] J. Redding, J. Schreiver, C. Shrum, A. Lauf, and R. Yampolskiy, "Solving NP-hard number matrix games V. CONCLUSIONS with Wisdom of Artificial Crowds," in Computer Games: AI, Animation, Mobile, Multimedia, This combination of genetic algorithm with wisdom of Educational and Serious Games (CGAMES), 2015, artificial crowds is complete, meaning it will always find an 2015, pp. 38-43: IEEE. answer. It is not guaranteed to be optimal, however this [11] A. C. Port and R. V. Yampolskiy, "Using a GA and algorithm is still successful and is capable of not only finding Wisdom of Artificial Crowds to solve solitaire some result but with good implementation can optimize to find battleship puzzles," in Computer Games (CGAMES), a better, if not best, result. 2012 17th International Conference on, 2012, pp. 25- 29: IEEE. The relationship of the genetic algorithm to the crowd- [12] R. V. Yampolskiy and A. El-Barkouky, "Wisdom of wisdom component proved to be advantageous. As the genetic artificial crowds algorithm for solving NP-hard algorithm developed artificial results to inform the wisdom of 83 Shortest Total Path Length Spanning Tree via Wisdom of Artificial Crowds Algorithm pp. 77–84 8 problems," International Journal of Bio-inspired computation, vol. 3, no. 6, pp. 358-369, 2011. [13] J. Fletcher, T. Fernando, H. Iu, M. Reynolds, and S. Fani, "A case study on optimizing an electrical distribution network using a genetic algorithm," in 2015 IEEE 24th International Symposium on Industrial Electronics (ISIE), 2015, pp. 20-25: IEEE. [14] B. Y. Wu, K. M. Chao, and C. Y. Tang, "Approximation algorithms for the shortest total path length spanning tree problem," Discrete applied mathematics, vol. 105, no. 1, pp. 273-289, 2000. [15] R. T. Wong, "Worst-case analysis of network design problem heuristics," SIAM Journal on Algebraic Discrete Methods, vol. 1, no. 1, pp. 51-63, 1980. [16] B. Y. Wu, G. Lancia, V. Bafna, K.-M. Chao, R. Ravi, and C. Y. Tang, "A polynomial-time approximation scheme for minimum routing cost spanning trees," SIAM Journal on Computing, vol. 29, no. 3, pp. 761- 778, 2000. [17] W. Cook, "Concorde TSP solver," See: http://www. tsp. gatech. edu/concorde. html, 2005. 84