=Paper=
{{Paper
|id=Vol-3231/iStar22_paper_5
|storemode=property
|title=Towards Goal-based Generation of Reinforcement Learning Domain Simulations
|pdfUrl=https://ceur-ws.org/Vol-3231/iStar22_paper_5.pdf
|volume=Vol-3231
|authors=Sotirios Liaskos,Shakil M. Khan,Reza Golipour,John Mylopoulos
|dblpUrl=https://dblp.org/rec/conf/istar/Liaskos0GM22
}}
==Towards Goal-based Generation of Reinforcement Learning Domain Simulations==
Towards Goal-based Generation of Reinforcement Learning Domain Simulations Sotirios Liaskos1 , Shakil M. Khan2 , Reza Golipour1 and John Mylopoulos3 1 School of Information Technology, York University 2 Department of Computer Science, University of Regina 3 Department of Computer Science, University of Toronto Abstract Reinforcement learning (RL) is a much studied branch of machine learning and one in which substantial progress has taken place over the past few years. In RL, intelligent agents repeatedly interact with their environment and learn from the consequences of their actions. A key to effective RL is often the presence of a simulated environment that mimics the one that the agent is supposed to be optimized against. Such simulators allow for great numbers of training iterations in a cost-effective manner and without affecting the real environment. A systematic modeling and design process that allows efficient development of maintainable and comprehensible simulators can, hence, be beneficial for effective RL. We propose an approach for model-driven generation of RL environment simulations with discrete action spaces using goal models. The proposal utilizes earlier work for model-driven development of decision-theoretic action theories, whereby the standard iStar 2.0 notation is extended to include preconditions, stochastic effects, and reward modeling. Models in the extended notation can be translated into a formal specification for model-based reasoning based on Markov Decision Processes (MDPs). To also allow for model-free RL we introduce a module that queries the action-theoretic, stochastic action and reward structure aspects of the generated formal specification in order to guide episodic simulations of the modeled domain. The module is wrapped by a popular framework for building RL training and testing environments, making it accessible by popular RL agent frameworks. Keywords iStar (i*) modeling, goal modeling, reinforcement learning, DT-Golog, Open AI 1. Introduction Reinforcement learning (RL) is an important artificial intelligence approach whereby intelligent agents can learn to optimize their behavior through engagement with their environments [10]. Key to realizing such agents is the presence of a simulated environment, repeated interaction therewith allows the agent to improve the average value it gains from such interactions. Once the agent is trained in the simulated environment it can be transferred to the real one in which it can quickly deliver the optimal solutions. iStar’22: The 15th International i* Workshop, October 17th, 2022, Hyderabad, India $ liaskos@yorku.ca (S. Liaskos); shakil.khan@uregina.ca (S. M. Khan); golipour@yorku.ca (R. Golipour); jm@cs.toronto.edu (J. Mylopoulos) https://www.yorku.ca/liaskos/ (S. Liaskos); http://www2.cs.uregina.ca/~skhan/ (S. M. Khan); https://www.cs.toronto.edu/~jm/ (J. Mylopoulos) 0000-0001-5625-5297 (S. Liaskos); 0000-0003-0140-3584 (S. M. Khan); 0000-0002-8698-3292 (J. Mylopoulos) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Researcher Travel Organized Tickets Travel ticketsAvailable Booked Authorization Obtained pre pre portalOpen Book Cancel Refundable pre Tickets (ref) Go Authorization eff Application Adjudicated eff Book Non- Submitted 0.95 + Refundable Tickets (nRef) eff pre eff cancelled eff Ref. Tickets 0.05 eff + 0.05 + Booked Failed to 0.8 Submit travelled Ref. Tickets Successfully 0.15 Legend Failed to Book Submitted eff + Submitted with Critiral Errors Authorization Rejected ... Precondition pre Precedence Effect 0.95 Link Authorisation Link 0.05 Obtained npr Non-Ref. Tickets Value ... Booked 0.8 Effect Negative Non-Ref. Tickets (taskEffect attaining) Precedence Link Utility Link Failed to Book (task attaining) Total Total Reward ... + Cost ... ... Effect Effect/Utility Table (non attaining) Effect Group defined elsewhere Figure 1: An extended goal model. Simulations for RL describe how the simulated system changes states and delivers rewards or penalties in response to actions chosen and performed by an intelligent RL agent. Developing such simulations may be benefited by a model-driven approach whereby high-level models of the action and state space can be automatically translated to actual simulation components, allowing for development efficiency and outcome maintainability, comprehensibility and reproducibility. In this paper, we describe how we reuse and extend an existing approach for translating goal models into formal specifications for model-based RL reasoning, so that simulation environments for model-free/learning-based RL can also be generated. The approach is based on extensions to the original translation routine, as well as the introduction of domain-independent integration components. Through the extension, analysts can switch between model-based and model- free analysis, to, e.g., benchmark alternative model-free techniques or assess the feasibility of a model-free approach when the model at hand is too large to be analyzed with model- based techniques. Thanks to the proposed extension, the overall framework can prove useful for developing process adaptation mechanisms within socio-technical systems that use past experience to improve ways to achieve business goals. We offer background in Section 2, the extension in Section 3 and conclude in Section 4. 2. Background 2.1. Action- and decision-theoretic extensions to iStar Our approach is based on a set of extensions proposed to the iStar 2.0 graphical notation [1] for capturing stochastic performance of tasks while allowing the modeling of action-theoretic aspects including preconditions, precedence constraints and effects [2, 3]. An example of the proposed notation can be viewed in Figure 1. The diagram, which represents a travel organization problem inspired by the example of the iStar 2.0 guide [1], includes many of the basic elements of the language: goals, tasks, qualities, AND- and OR-decompositions as well as a role and their actor boundary. To this baseline, a set of constructs have been added to allow for expression of action-theoretic aspects. Firstly, a set of logical domain predicates are introduced for modeling the state of the world. Examples include ticketsAvailable or authorizationRejected. We borrow the belief construct from GRL [4] to place such domain predicates in the diagram. We further specialize these beliefs into effects and preconditions. While the former contain single predicates, the latter contain logical formulae thereof. The preconditions are connected to tasks through precedence links (or, respectively, their negative dual, negative precedence links) signifying that preconditions must be satisfied before tasks are performed (resp., that if they are satisfied the tasks cannot be performed). Such links can also be drawn between tasks (resp. goals) meaning that the target cannot be performed (resp. start to be fulfilled) unless/if the origin has been performed (resp., satisfied). These additions are already useful for describing deterministic action-theoretic aspects of goal models allowing translations to key formalisms including Golog [5], Hierarchical Task Networks (HTNs) [6, 7] or other planners [8]. The additional aspect of interest is that tasks can be stochastic, i.e., lead to alternative effects, each with a different probability. To model this, effect groups are introduced which are simple tree-like structures that represent alternative effects of tasks, each annotated with its probability, when available. Finally, the framework considers that a quality is attained (or denied) not by mere attempt to execute a task or fulfill a goal, but by the observed success of task performance. Qualities are hence connected with effects rather than directly with tasks and goals. The links used for those connections are specializations of iStar 2.0’s contribution links that we call utility links to highlight their decision-theoretic semantics. Utility links may also be annotated by a number signifying the level at which the truth value of an effect affects the overall utility of a state. Often, however, both probability and utility structures are too complex to be represented through annotations. In such cases effect tables and utility tables are utilized, accompanying the diagram. 2.2. DT-Golog and Model-based reasoning The extensions introduced above allow for automatable translation of the goal models into DT-Golog, a formalism that allows representation and reasoning of action-theories with decision- theoretic components (probabilities, rewards) [9]. DT-Golog models action theories through the concept of a situation which, roughly, represents a history of actions from a distinguished initial situation 𝑆0 . State is represented through predicates called fluents. To map situations to truth values of fluents successor state axioms are used. Given an initial specification of the fluents, successor state axioms describe how their values change or remain unchanged when actions happen. Finally, actions may or may not be feasible in a given situation based on what is prescribed in action precondition axioms. DT-Golog distinguishes between agent actions and stochastic actions. Each action of the former type is associated with a set of alternative actions of the latter type, each with a distinct probability. Thus, upon attempt of an agent action, one of the associated stochastic actions will actually be performed based on the corresponding probability. DT-Golog also allows definition of total reward as a function of situations. Goal models drawn using the extended notation can be translated into such a DT-Golog specification through the application of a set of translation rules [2]. Roughly, these rules map tasks into actions, domain predicates into fluents, the AND/OR decomposition into a corresponding logical formula of fluents, precedence and effect links into precondition and successor state axioms, and structures of utility links into reward structures in DT-Golog. Through the use of the DT-Golog interpreter, the result allows identification of policies – roughly: situation-based action recommendations in the form of DT-Golog programs – whose adoption is understood to maximize expected reward. 2.3. Model-free reinforcement learning DT-Golog based reasoning constitutes model-based analysis whereby the action probabilities are known and optimization techniques, such as DT-Golog’s engine or dynamic programming, suffice for the identification of optimal courses of action – i.e., there is no real “learning" involved. Model-based techniques, however, require complete knowledge of the probability model and are computationally expensive when models are larger and more complex. To address such shortcomings, model-free RL can be considered whereby intelligent agents develop optimal strategies through just repeating action attempts and observing their outcomes. In its most basic form, an RL agent is given a set of actions it is capable of performing, and a set of states that the system (i.e., the agent itself and/or its environment) can be at, based on the actions previously performed. Thus, when the RL agent chooses and performs an action, this may result in a state transition and the acquisition of a positive or negative reward, all of which (the new state, the reward and the probability of each) are initially unknown. In such a context, the RL agent develops provisional optimal policies which it repeatedly improves by trying different actions and observing the reward outcome. In RL practice, rather than deploying an RL agent in the real target environment and allow it to perform sub-optimal actions until it learns, utilization of a simulation of the target environment is often more sensible, therewith agents can be trained and tested prior to such deployment. A simulation of the problem of Figure 1, for example, would be a program that receives as inputs agent tasks and returns (i) the new state in which the system is now as a consequence of the performance of the task and (ii) the reward (or penalty) accrued by performing the task and going to that state. For example, when the task “Authorization Adjudicated" is given as input, a simulation would stochastically change the state to one in which the authorization is obtained or rejected, and return the reward in either case based on the reward information in the model. By repeatedly interacting with this program the RL agent learns what actions are optimal at which state with respect to expected reward, i.e., it learns an optimal policy – noting that the suitability of the policy against the target system still depends on the accuracy of the probabilities embedded in the simulation program. Moreover, model-free RL distinguishes between continuing problems, where the obtained reward is continuously utilized for learning, and episodic problems, in which there are designated terminal states which, when reached, the overall trajectory up to that point (the episode) becomes the unit of evaluation before a new episode starts thereafter. In our context, the goal decomposition models are suggestive of an episodic structure whereby an episode ends when the root goal is fulfilled. In the following, we sketch how we can re-use our translation of goal models to DT-Golog for model-based reasoning to also derive simulation environments for model-free learning. The Domain Generic Specific Part Part RL Agent gym.Env Domain DT-Golog Prolog Query i/face Spec Interpreter Interface GMEnv (pyswip) Spec Query Addendum Routines Figure 2: Solution Architecture. Components developed for model-free RL are shaded. proposed capability can help analysts in various ways. On one hand, independent of the presence of an accurate probability model, analysts can compare various RL algorithms [10] among each other and against their model-based counterparts (e.g., DT-Golog, dynamic programming) with respect to, e.g., their accuracy, training latency or sensitivity under changing parameters. On the other hand, when models are large, certain model-free RL approaches may prove to be preferable to computationally expensive model-based solutions. 3. Generating training environments 3.1. Overview The overall architecture of the solution can be seen in Figure 2. Extended goal models are translated into a domain specification which, along with a DT-Golog interpreter, can be used for performing model-based decision theoretic reasoning [2]. From an implementation standpoint, both the specification and the DT-Golog interpreter are Prolog listings. To allow for simulations, two components need to be added to these listings: (a) additional domain specification details and (b) domain-independent query clauses. The resulting augmented specification can then be accessed from external applications for guiding simulations. In our case, a Python-to-Prolog interface allows for a Python simulation module (GMEnv) to query the specification for informa- tion including preconditions, effects, stochastic actions and their respective probabilities, as well as reward structures. With this capability, GMEnv is able to simulate step-wise performance of simulated actions, as guided by an external driver – i.e., an RL agent. 3.2. Specification Additions Let us explore in more detail the additions that need to accompany a DT-Golog specification to allow for executing simulations. These consist of a domain-specific and a generic part. The domain-specific part includes a reproduction of the action precondition axioms for agent actions to complement those on stochastic actions that DT-Golog requires and are part of the original translation rules. Conveniently, in the goal model, preconditions are indeed specified at the level of agent actions and translation rules apply the preconditions uniformly to all stochastic actions associated with that agent action. Thus, it is easy to expand the translation rules to also produce precondition axioms for agent actions: for every group of stochastic actions, produce a precondition axiom which includes the agent action associated with the stochastic ones. The generic part of the specification addendum consists of a number of helper routines (Prolog rules) that allow querying of the domain specification. Letting L be a list of agent actions that have been performed from the first to the last, and 𝑎 an agent action, the most important of the added predicates are: (a) queryState(L), for translating an action history L, i.e., a situation, into a state, i.e., an array of fluent truth values, (b) isFeasible(𝑎,L), to check if action 𝑎 is feasible after action history L, (c) queryReward(L), to retrieve total reward of L, and (d) queryDone(L) for checking if the root goal has been fulfilled after L, through checking the corresponding logical formula constructed using fluents which represent success-signifying effects of leaf-level tasks. Key in implementing these rules is the realization that, again, the RL concept of state is different from that of Golog’s situation: the former represents a configuration of values of the fluents, i.e. the variables that represent state, and the latter is a history of agent actions. As we saw, given a situation, a state is retrievable through collecting fluents that according to the successor state axioms hold in that situation. State is encoded through a binary array, each element of which represents a fluent, and the binary value signifies if the element holds or not. Such array is easily translatable to an integer representation, as required by the client environment. Similar indexing is introduced for agent actions. 3.3. The GMEnv Component The querying routines added in the specification can be utilized by external simulation-imple- menting tools. In our implementation, the client GMEnv is a Python class which implements the OpenAI Gym’s [11] interface, Env, which is widely used for developing simulation environments and is required by popular RL agent development frameworks. A GMEnv object maintains information about episode development through a list of agent and stochastic actions that have been performed. Interface Env requires implementation of three important methods including: (a) an initialization routine specifying, among other things, the type of input and output spaces, (b) a reset routine, whereby the state is typically brought to its initial value, (c) a step routine, which accepts an action 𝑎 as a parameter and returns the state that results from executing the action, the reward accrued from performing it, as well as a boolean value representing whether the system has reached a terminal state. Our implementation of step utilizes the routines mentioned in the previous section in order to calculate the feasibility, probability profiles, and rewards of proposed actions as well as whether they lead to root goal fulfillment, which marks the end of an episode. Implementation of reset trivially returns the state of GMEnv to the initial one in which no action has been performed. 4. Concluding Remarks and Future Work We described an extension to a framework for performing decision-theoretic reasoning with goal models with additional components that allow the automatable generation of discrete- action simulation environments for RL. Through the extension, goal models can be used as the basis for model-driven engineering both of reasoners for model-based identification of optimal policies and of simulations that allow model-free learning of optimal policies. The specific extension to derive simulations for model-free learning can be utilized in different ways, including benchmarking model-free RL algorithms against a known model, or testing model- free solutions when models that are too large for model-based reasoning. While the overall framework has been motivated towards design-time identification of optimal task sequences under uncertainty within socio-technical analysis contexts, the proposed extension also paves the way for model-driven design and prototyping of run-time process/workflow management components that learn from experience. Of great priority for future investigation is whether there are RL techniques that are indeed more effective and efficient than DT-Golog for any size or type of goal model, and, furthermore, whether there are synergies between the two, e.g., training against the simulation to assist parallel search in the state space. Furthermore, we are exploring ways to model continuous state spaces and episodic structures that contain more than one root goal fulfillment instance. Should that be possible, the modeling framework could prove useful for iStar-driven development of RL agents for physical or cyber-physical systems. References [1] F. Dalpiaz, X. Franch, J. Horkoff, iStar 2.0 Language Guide, The Computing Research Reposi- tory (CoRR) abs/1605.0 (2016). URL: http://arxiv.org/abs/1605.07767. arXiv:1605.07767. [2] S. Liaskos, S. M. Khan, J. Mylopoulos, Modeling and reasoning about uncertainty in goal models: a decision-theoretic approach, Software and Systems Modeling (2022). doi:https://doi.org/10.1007/s10270-021-00968-w. [3] S. Liaskos, S. M. Khan, M. Soutchanski, J. Mylopoulos, Modeling and Reasoning with Decision-Theoretic Goals, in: Proceedings of the 32th International Conference on Con- ceptual Modeling, (ER’13), Hong-Kong, China, 2013, pp. 19–32. [4] E. S. Yu, GRL - Goal-oriented Requirement Language, 2001. URL: https://www.cs.toronto. edu/km/GRL/. [5] X. Wang, Y. Lespérance, Agent-oriented requirements engineering using ConGolog and i*, in: In Bi-Conference Workshop at Agents 2001 and CAiSE’01 (AOIS-2001)., 2001. [6] S. Liaskos, S. McIlraith, S. Sohrabi, J. Mylopoulos, Representing and reasoning about preferences in requirements engineering, Requirements Engineering Journal (REJ) 16 (2011) 227–249. [7] S. Liaskos, S. A. McIlraith, S. Sohrabi, J. Mylopoulos, Integrating Preferences into Goal Models for Requirements Engineering, in: Proceedings of the 10th IEEE International Requirements Engineering Conference (RE’10), Sydney, Australia, 2010. [8] S. Liaskos, S. M. Khan, M. Litoiu, M. D. Jungblut, V. Rogozhkin, J. Mylopoulos, Behavioral adaptation of information systems through goal models, Informations Systems (IS) 37 (2012) 767–783. [9] M. Soutchanski, High-Level Robot Programming in Dynamic and Incompletely Known Environments, Ph.D. thesis, Department of Computer Science, University of Toronto, 2003. [10] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, The MIT Press, 2018. [11] Open AI Gym, 2022. URL: https://github.com/openai/gym.