=Paper=
{{Paper
|id=Vol-1444/paper1
|storemode=property
|title=Learning Rules for Cooperative Solving of Spatio-Temporal Problems
|pdfUrl=https://ceur-ws.org/Vol-1444/paper1.pdf
|volume=Vol-1444
|dblpUrl=https://dblp.org/rec/conf/ki/Apeldoorn15
}}
==Learning Rules for Cooperative Solving of Spatio-Temporal Problems==
Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning Learning Rules for Cooperative Solving of Spatio-Temporal Problems Daan Apeldoorn Information Engineering Group, Technische Universität Dortmund, Germany daan.apeldoorn@tu-dortmund.de Abstract. This paper addresses the issue of creating agents that are able to learn rules for cooperative problem solving behavior in different multi-agent scenarios. The proposed agents start with given rule frag- ments that are combined to rules determining their action selection. The agents learn how to apply these rules adequately by collecting rewards retrieved from the simulation environment of the scenarios. To evaluate the approach, the resulting agents are applied in two different example scenarios. Keywords: multi-agent simulation, cooperative problem solving, rule learning, knowledge extraction 1 Introduction Learning to solve problems cooperatively is an interesting and challenging task for agents in multi-agent scenarios with possible applications in logistics, schedul- ing or robotics (e. g., [5]). In this paper, it is tried to approach this issue by intro- ducing agents that learn to apply rules which are combined from given rule frag- ments. The agents are then evaluated in the context of different spatio-temporal problem solving scenarios that are defined using the multi-agent framework AbstractSwarm [1, 2] and the learned rules are extracted from the agents’ epistemic state. Section 2 introduces the agent model. Section 3 evaluates the agent model in the context of two different example scenarios and the results are presented. A conclusion and an outlook on future work are given in Section 4. 2 Agent Model The agent model is considered a template which is instantiated for every agent that is part of a multi-agent scenario. Thus, only homogeneous multi-agent sce- narios are considered here, where every agent follows the same implementation. Agents have a collective epistemic state, but every agent infers its own actions also considering its current percepts. In AbstractSwarm the two basic concepts agents and stations exist. The former refer to the active components of a scenario (being able to act) and the 5 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning latter represent the passive components of a scenario (serving as locations of the agents). These concepts are used in the following as a basis for the agents’ percepts and actions. 2.1 Percepts, Actions and Communication Following the agent interface of the AbstractSwarm simulation framework, the percepts of an agent a consist of: – a’s current location (i. e., the station where a is currently located) – the current location of all other agents (i. e., the stations where the other agents are currently located) – the next action of every agent ai �= a that was computed in the same time step before a (i. e., the station every agent ai selected as its next target)1 An agent’s action is defined as the decision about its next target location chosen from a given set of potential target locations. Before the action of an agent a is executed (i. e., after the agent decides about its next target location, but before the agent starts moving towards this new target location), the agent communicates its decision to all other agents aj �= a that are computed subsequently to a. This information then is part of the percepts of the agents aj (see the third point at the beginning of Section 2.1). 2.2 Rules The agents are provided with a set of rule fragments that can be combined to rules. The resulting rules are used by the agents to select their target locations. A rule fragment is either the subject or the strategy of a rule. A rule subject represents a property of a location (e. g., the current free space of the location) and a rule strategy is a selection criterion (e. g., whether a location should be selected according to the maximum or the minimum value of a given subject). A complete rule is of the form subject strategy, stating that locations will be selected based on the given subject and according to the given strategy. As an Example, consider the rule FreeSpace Max : Following this rule, an agent would choose, among all potential target location, the one with the max- imum free space as next target. Since locations can have a broad variety of different properties that could potentially serve as selection criteria, a subset of rule subjects must be selected here. The focus is set to subjects that seem to be relevant for the example scenarios considered in Section 4:2 1 Note that in the AbstractSwarm framework, in every discrete time step, all agents of a scenario are computed subsequently (in random order) and agents can consider the information communicated by other agents that were computed before. 2 Note that the selection of possible rules is also restricted by the limited number of properties that are available in the AbstractSwarm framework. 6 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning – FreeSpace: The current available space of a location (e. g., the remaining number of car agents that are still able to enter a parking deck until it will be completely covered) – MaxSpace: The total available space of a location (e. g., the total number of parking boxes of a parking deck) – VisitTime: The duration of visiting a location, excluding the time to get to the location (e. g., the time that will be needed to do a transaction on a bank counter) – RemainingTime: The time left until a fully covered location becomes avail- able again (e. g., the remaining time a bank counter will be occupied while a bank customer agent is finishing its transaction) The following rule strategies are selected, which are covering the most common cases: – Min: Selection according to the minimal value of the rule subject – Max : Selection according to the maximal value of the rule subject – Avg: Selection according to the average value of the rule subject – None: No selection according to the value of the rule subject (random selec- tion regarding the rule subject) In case the application of a rule results in the selection of more than one target location (i. e., the values of the rule subject are equal), multiple rules can be chained, meaning that the rules are applied subsequently. As an example, following the rule chain FreeSpace Max →VisitTime Min, agents will first select the locations with a maximum amount of free space (com- pared to all other potential target locations). If more than one location are currently having the same amount of free space, the one with the minimum visiting time will be selected from these. If all rules are chained and there are still more than one location in the result set, one of the locations in the result set is chosen randomly. The order of the rule chains is learned based on the agents’ experiences and is inferred from the agents’ current epistemic state. The epistemic state together with the learning and the inference process will be explained in the following. 2.3 Epistemic State The epistemic state of an agent comprises its knowledge about which rules (or rather rule chains) are most preferable. Since the usefulness of rules strongly depends on the problem to be solved (and in many cases there is no a priori knowledge about which rules could be useful), the agents learn their epistemic state by acting in the problem scenario and by earning rewards for their actions. The epistemic state is represented as a Bayesian Logic Network [3] which is trained using Statistical Relational Learning. By this, the learned rules can later be extracted easily from the epistemic state. In the following three subsections, the Bayesian Logic Network representing the epistemic state will be introduced and both the learning and the inference process will be described. 7 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning Bayesian Logic Network. A Bayesian Logic Network (BLN), introduced by Jain et al. in [3], is a Bayesian Network extended by (many-sorted) first-order Logic. Functions and predicates (which are considered functions with the co- domain {True, False}) are associated with the nodes of an underlying Bayesian Network, such that every node represents a (conditional) probability distribu- tion over the co-domain of the associated predicate. Logical formulas can be added, which then will be considered by the inference mechanism (in addition to the probability distributions of the network). A BLN can be trained such that the (conditional) probability distributions are learned from a given set of data. The BLN is implemented and visualized in the following using ProbCog [4], a software suite for Statistical Relational Learning. Besides some logical formulas, the BLN for modeling the epistemic state of the agents consists of only three nodes: – The node selectedValue( sit ) represents the function with the corresponding probability distribution for the rule subjects. – The node applyRule( sit ) represents the function with the corresponding conditional probability distribution for the rule strategies, given a rule sub- ject. The associated function of the node is later queried for inference. – The isolated node valueComplete( sit, val ) represents a predicate which is only relevant for the logical formulas implemented in the BLN. (These formu- las are used later to handle special cases where some of the potential target locations of an agent are lacking a property and therefore are incomparable regarding this subject, see explanation of the logical formulas below). The functions associated with the nodes selectedValue( sit ) and applyRule( sit ) depend on the current situation of an agent (which is used to specify evident knowledge for the inference queries and for the training data later). The predicate valueComplete( sit, val ) depends on the current situation of the agent and on the subject of a rule. Figure 1 shows the described BLN in the initial state. Fig. 1. Initial BLN for the agents’ epistemic state. Note that the agents are using a collective epistemic state (see Section 2), such that all agents access the same BLN (both for learning and inference). 8 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning However, every agent a infers its own actions, which are additionally depending on its current external state σta at time t in the environment. The logical rules implemented in the BLN handle the cases where some loca- tions are lacking some of the properties represented by the rule subjects, which can lead to anomalies for certain subjects (e. g., a closed bank counter does not have a limited duration for visiting, thus applying the rule VisitTime Max would always lead to the selection of the closed bank counter). These kinds of problems are covered by the following formulas: selectedV alue(sit, V isitT ime) ∧ ¬valueComplete(sit, V isitT ime) ⇒ ¬applyRule(sit, M ax) selectedV alue(sit, RemainingT ime) ∧ ¬valueComplete(sit, RemainingT ime) ⇒ ¬applyRule(sit, M ax) The first formula states that the rule VisitTime Max is never applied in case not all of the potential target locations have a defined duration for visiting. The second formula states this analogously for the time left until a fully covered location is available again. Learning. While a simulation episode is running, the agents select their loca- tions by trying out different rule subjects and strategies. After the execution of an action is completed by an agent a ∈ A (i. e., after the agent visited a selected location), the agent gets a local reward r from the simulation environment, which is calculated as follows: t� � � 1 rwd 1 � a r := w(σt ) (1) trwd − tact + 1 t=t |A| act a∈A where tact is the point in time when a selected a target location, trwd is the point in time when a finished visiting this location, σta describes the state of a at time t, and function w is defined as: � a 1, if a is visiting a location at time t w(σt ) := . (2) 0, otherwise Thus, the reward for the action of agent a is calculated from the number of agents in the scenario, that are visiting locations as a consequence of the action performed by a, averaged over time tact to time trwd : Agents gain higher re- wards, the more their actions allow other agents to simultaneously perform their respective actions and lower rewards the more they are restricting other agents. An agent can easily calculate a global reward from the local rewards of its actions by summing up all local rewards until the end of a simulation episode. Before a new simulation episode starts, the agents store their experiences from the previous episode (i. e., which rules were applied in which situations) by 9 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning logging training datasets that consist of value assignments to the functions. The following example shows three exemplary training data records: selectedV alue(sit1 ) = F reeSpace applyRule(sit1 ) = M ax selectedV alue(sit2 ) = F reeSpace applyRule(sit2 ) = M ax selectedV alue(sit3 ) = V isitT ime applyRule(sit3 ) = M in A situation identifier siti contains the name of the agent that was applying a rule and an index value (including the time when the rule was applied). In the given example, it was preferable in two cases to select a target location according to the rule FreeSpace Max and in only one case it was preferable to select a target location according to the rule VisitTime Min. The amount of records that are stored depends on the global reward earned by the agent during an episode: The higher the reward, the more training data records are stored. After storing the training data records, the (conditional) probability distributions of the BLN are updated by counting relative frequencies from the training data. Inference. For inferring results from the trained BLN, the function associated with the node applyRule( sit ) is queried for every rule subject (i. e., for every value of the co-domain of the function associated with the node selectedValue( sit )). The results are the conditional probabilities over the different rule strategies, given the subjects. Based on the conditional probabilities retrieved from the BLN, if an agent determines its next target location, it performs the following steps: 1. The rule (i. e., the subject-strategy-combination) with the overall highest probability value is applied to select a target location. 2. If this results in more than one location (i. e., the resulting locations are indistinguishable regarding the rule subject), the rule with the next lower probability value is applied to the results from the previous rule. This is continued until there is only one location left in the result set or until all rule subjects were used. Thus, the conditional probabilities P (Str1 |Sub1 ) > ... > P (Strn |Subn ) would lead to the rule chain Sub1 Str1 → ... →Subn Strn . (If in this case there are still more than one location in the result set, one of the remaining locations is selected randomly.) By this, stronger rules, that where reinforced through the learning process, are preferred over weaker rules and weaker rules are used with lower priority in a rule chain (in case not all stations could be distinguished by the stronger rules). 10 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning To explore the different combinations of rule fragments (even if the agents already learned that some rules are preferable), an exploration probability de- termines in how many cases an agent decides to use another rule than the one that was inferred from its current epistemic state. The exploration probability depends on the number of already tried rules and the total amount of change in the conditional probability distribution. By this, the exploration probability is slightly discounted over time with increasing experience of the agent. 3 Evaluation 3.1 Test Scenarios This section introduces the two example scenarios from [1], which are used here as test scenarios to evaluate the agent model. In both scenarios, agents have to solve a problem cooperatively. The scenarios are modeled using the AbstractSwarm framework. Scenario 1: School Timetabling. In this scenario, a small fictive school is considered, where timetables have to be created for teachers, pupils and rooms. The school comprises: – 2 math teachers, 2 English teachers and 1 music teacher – 5 classes (every class has to get 2 math lessons, 2 English lessons, 1 music lesson) – 4 rooms (3 normal class rooms, 1 special music room) Teacher agents, class agents and course agents must organize themselves to effi- ciently create an optimized timetable for the school (i. e., which agent has to be at which time in which room). In this scenario, all locations have the same size (e. g., only one class can be located in a room at a point in time) and the duration of visiting is identical for all locations (i. e., all courses have the same length). Scenario 2: Production Simulation. In this scenario, a small factory is con- sidered, where workers are producing different products. As part of the quality assurance process, the products have to be analyzed using different machines. The factory comprises: – 8 workers – 2 kinds of products (5 of each kind, one kind having the need to be analyzed at higher priority) – 3 machines (2 of which being able to analyze on their own, 1 needing a worker to monitor the analysis Worker agents and product agents must organize themselves to efficiently create an optimized production plan with few waiting times on the machines. Machines are considered locations with different amounts of space (i. e., one of the machines 11 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning is able to analyze more than one product at a time) and with different durations of the production and the analysis processes (i. e., the production times depend on the kinds of products and the analysis time needed depends on the specific kind of analysis of a machine.) 3.2 Results First, the BLNs learned by the agents are inspected in this section and the learned rules (or rule chains) are extracted from the BLNs. After that, the overall performance of the agents for finding adequate solutions for the scenarios will be analyzed. Learned Rules. In both cases, the agents started without any a priori knowl- edge about the (conditional) probabilities of the rule fragments (as shown in Figure 1) and 100 runs were performed. The resulting BLNs are shown in Fig- ure 2 and Figure 3. Fig. 2. Learned BLN for School Timetabling (Scenario 1) after 100 runs. From the BLN for Scenario 1 (Figure 2) it can be seen that the agents learned the rule FreeSpace Max with a high success probability. Since in Scenario 1 all locations are of the same size and all courses have the same duration, no further distinctions can be made regarding other rule subjects. Thus, this is the only rule that could be learned. In case of Scenario 2 (Figure 3), it can be extracted from the BLN, that the rule chain FreeSpace Max →VisitTime Avg→RemainingTime Min was learned (since P (M ax|F reeSpace) > P (Avg|V isitT ime) > P (M in|RemainingT ime)). Thus, like in case of Scenario 1, it also seems to be useful here to select a target location according to its current available space. But the result is less clear than in Scenario 1: As second and third criteria, agents select their locations according to the average duration of a location (i. e., the average duration of production 12 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning Fig. 3. Learned BLN for Production Simulation (Scenario 2) after 100 runs. and analysis tasks) and according to the minimal time left until a fully covered production or analysis location becomes available again. Performance. To analyze the performance of the learning agents, the total waiting time of all agents after solving a scenario is considered (i. e., the sum of the idle times of every agent, after all tasks defined in the scenario description were completed). Therefore, 20 repetitions of 100 runs are performed for each scenario and the results are averaged over the 20 repetitions. Every repetition is divided into two phases: 1. The first 50 runs are a learning phase where the exploration probability is discounted slightly depending on the experience of the agents (as described in Section 2.3). 2. The second 50 runs are an exploitation phase, where the exploration proba- bility is set to zero and the agents act only based on the rules learned in the first phase. After every repetition, the probability distributions of the BLN are reset to the initial state shown in Figure 1. Figure 4 and Figure 5 show the performance results for the school timetabling and the production scenarios: The curves represent the minimal waiting time of all agents after r runs (averaged over the 20 repetitions). The gray bars show the waiting time of one selected representative repetition. Note that the agents do not always find a solution for a scenario: The missing gray bars (Figure 5) indicate that the agents could not find a valid solution in this simulation run fulfilling all constraints of the scenario description. In both Figure 4 and Figure 5 it is shown that the agents are able to quickly find rather good solutions in the two scenarios. Some good solutions are already found randomly at early stages of the learning phase, but it can be seen that the overall performance is getting better through exploitation of the learned rules. 13 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning Fig. 4. Performance of learning agents in the school timetabling scenario (Scenario 1). Fig. 5. Performance of learning agents in the production scenario (Scenario 2). 14 Proceedings of the KI 2015 Workshop on Formal and Cognitive Reasoning 4 Conclusion and Future Work In this paper, an agent model based on a BLN was presented where multiple agent instances are able to collectively learn rules (and rule chains) for cooper- ative solving of spatio-temporal problems. The resulting agents were evaluated in the context of two example scenarios and the results showed that the agents benefit from applying the learned rules (and rule chains). Unlike other agent-based learning techniques (like Reinforcement Learning), the learned knowledge (i. e., the rules and rule chains) can be easily inspected and extracted from the agents (as shown in Section 3.2) and the agents can be adapted or extended by adding further rule subjects or strategies. Besides that, learning behavioral rules rather than state-action-pairs reduces the state-action- space significantly, which is especially useful in high-dimensional environments (as it is inherently the case in multi-agent-systems, since the state-action-space grows exponentially with the number of agents [6]). As future work, the rule learning approach could be tested in further, more open environments, as it is the case e. g., for robots cooperating in real world environments (for a related real world scenario see e. g. [5]). References 1. Apeldoorn, D.: A spatio-temporal multiagent simulation framework for reusing agents in different kinds of scenarios. To be published in the proceedings of the MATES 2015 conference. 2. Apeldoorn, D.: Abstractswarm – a generic graphical modeling language for multi- agent systems. In: Klusch, M., Thimm, M., Paprzycki, M. (eds.) Multiagent System Technologies. LNCS, vol. 8076, pp. 180–192. Springer, Berlin Heidelberg (2013) 3. Jain, D., Waldherr, S., Beetz, M.: Bayesian logic networks (extended version). Tech- nical report ias-2009-03, Technische Universität München (2009) 4. Probcog toolbox – a toolbox for statisitical relational learning and reasoning. http: //ias.in.tum.de/software/probcog, last accessed on 2015-08-15 5. Scheuren, S., Stiene, S., Hertzberg, J., Hartanto, R., Reinecke, M.: The problem of spatio-temporally constrained motion planning for cooperative vehicles. In: Pro- ceedings of the 26th Workshop “Planen, Scheduling und Konfigurieren, Entwerfen” (PuK 2011) (2011) 6. Tan, M.: Independent vs. cooperative agents. In: Proceedings of the Tenth Inter- national Conference on Machine Learning. pp. 330–337. Morgan Kaufmann, San Francisco (1993) 15