=Paper=
{{Paper
|id=Vol-2741/paper-06
|storemode=property
|title=Reinforcement Learning-driven Information Seeking: A
Quantum Probabilistic Approach
|pdfUrl=https://ceur-ws.org/Vol-2741/paper-06.pdf
|volume=Vol-2741
|authors=Amit Kumar Jaiswal,Haiming Liu,Ingo Frommholz
|dblpUrl=https://dblp.org/rec/conf/sigir/Jaiswal0F20
}}
==Reinforcement Learning-driven Information Seeking: A
Quantum Probabilistic Approach
==
Reinforcement Learning-driven Information Seeking: A Quantum Probabilistic Approach Amit Kumar Jaiswal[0000−0001−8848−7041] , Haiming Liu[0000−0002−0390−3657] , and Ingo Frommholz[0000−0002−5622−5132] University of Bedfordshire Luton, United Kingdom {amitkumar.jaiswal,haiming.liu,ingo.frommholz}@beds.ac.uk Abstract. Understanding an information forager’s actions during inter- action is very important for the study of interactive information retrieval. Although information spread in an uncertain information space is sub- stantially complex due to the high entanglement of users interacting with information objects (text, image, etc.). However, an information forager, in general, accompanies a piece of information (information diet) while searching (or foraging) alternative contents, typically subject to decisive uncertainty. Such types of uncertainty are analogous to measurements in quantum mechanics which follow the uncertainty principle. In this paper, we discuss information seeking as a reinforcement learning task. We then present a reinforcement learning-based framework to model the foragers exploration that treats the information forager as an agent to guide their behaviour. Also, our framework incorporates the inherent uncertainty of the foragers’ action using the mathematical formalism of quantum mechanics. Keywords: Information Seeking · Reinforcement Learning · Informa- tion Foraging · Quantum Probabilities. 1 Introduction Web searchers, in general, move from one webpage to another by following links or cues while keeping the consumed information (intake) with itself without at- taining a generalised appetite (information diet) in possession of uncertain and dynamic information environments [1, 2]. In general, the evolution of informa- tion patterns from user interaction keeps searchers in an information seeking process to not consume optimised information diet (the information goal). So, there needs to be a mechanism that can guide the foragers during their search process in order to set a realistic information appetite. User interaction is an important part of the search process which can enhance the search performance and the information foragers’ search experiences and satisfaction [3, 5, 15]. User Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIRDS 2020, 30 July 2020, Xi’an, China (online). 16 action and their dynamics during search play an important role in changing be- haviour and user belief states [12]. It has been recently demonstrated that action behaviour representations can be learned using reinforcement learning (RL) [11] by extrapolating a policy in two components - action representation and its transformation. To effectuate the information foragers’ (or searchers’/users’) [6] cognitive ability during the search, we treat the searcher as an RL agent which follows Information Foraging Theory (IFT) [4], to understand how the users can learn in an ongoing process of finding information. Furthermore, the learn- ing ability of the users can be signalled by the RL approach through giving a free-choice of search scenarios in an uncertain environment. For instance, the information seeker must optimise the trade-off between exploration by sustained steps in the search space on the one hand and exploitation using the resources encountered on the other hand. We believe that this trade-off characterises how a user deals with uncertainty and its two aspects – risk and ambiguity during the search process [6]. Therefore the pattern of behaviour in IFT is mostly se- quential. Risk and ambiguity minimisation cannot happen simultaneously, which leads to an underlying limit on how good such a trade-off can be. This lets the information foraging perspective of information seeking converge with the de- veloping field of quantum theory [6]. Moreover, web search engines enable their users to efficiently access a large amount of information on the Web, which in turn leads search users to learn new knowledge and skills during their search processes. When the users search to obtain knowledge, their information needs1 are mostly varied, open-ended, and seldom not clear at the start. Such types of search sessions generally span multiple queries and involve rich interactions, therefore our aim is to model such kind of information foraging process where the users’ cognitive state changes during search. Due to its inherently complex and intense interactive nature, the effective and interactive information foraging process is exigent for both the users and the search systems. Hence, our focus is to incorporate contextual semantic infor- mation in modelling the information forager with the usage of the mathematical framework of quantum theory, i.e. quantum probabilities based on geometry. Specifically, we propose a quantum-inspired reinforcement learning approach that (a) models the information foragers’ behaviour, where action-selection (or policy) is leveraged as an Actor-critic method [22] to enhance the agent’s expe- rience in a text query-matching task; (b) learns the policy where query repre- sentation is parameterised using quantum language models, with a focus on the interaction across multi-meaning words. 2 Related Work This section covers aspects of reinforcement learning, Quantum theory in dy- namic information retrieval (IR), in particular, interactive information retrieval [8, 7], and Information Foraging theory. 1 we consider an IN is expressed by a query or series of queries 17 Reinforcement Learning in Information Retrieval: Humans’ transfer of informa- tion to other animals is a common method of learning and interaction, which is generally called reinforcement learning. Reinforcement learning [10] (RL) tech- niques are motivated by our sense of decision making in humans which appears to be biologically rooted. Within such biological roots [9], when an information for- agers’ action ends up with a disadvantageous consequence (or negative payoff), such action will not be counted in the future; whereas, if his/her action leads to a successful consequence (or positive reward), it will happen again. User involve- ment in information searching is primarily a decision making (or action taking) process [20], where users reflect identical RL features during this process. We will adopt RL models to manifest the mechanisms prevailing users’ learning of information from searching. Previous work [21] found that a search system’s in- formation can be enriched to advance search intention and automate the difficult query reformulation by modelling the search context. Reinforcement learning is an important method that can let the system employ the search context and relevance feedback simultaneously. Also, this approach allows the system to deal with exploration (widening the search among different topics) and exploitation (moving deeper into generic subtopics) which has been supportive in information retrieval [23, 24]. Exploration and exploitation methods are usually employed in tasks associated with recommender systems or information retrieval, such as foraging strategies [25], recommendation [26] or image retrieval [27]. However, reinforcement learning is mainly used by search/retrieval systems [31], which collect users’ interests and habits over a continuous period, while in a specific search scenario the users in a given search session are more interested in the holistic improvement of the search results than relying on arbitrary future search sessions. Quantum Theory and Information Retrieval: Quantum Theory (QT) has been matured to reinforce the search potential by employing the mathematical for- malism of quantum mechanics to information retrieval [28]. The aim of introduc- ing the QT formalism was to elucidate the implausible behaviour of micro-level search actions, which classical probability theory may not be able to model. Furthermore, it is an expressive formalism that can combine prominent proba- bilistic, geometric and logic-based IR approaches. The mathematical foundation of the Hilbert space formalism was introduced in [29] to apply this mathematical framework outside of Physics. We refer to events as a subset of a sample space of all potential events in classical probability theory, whereas in Quantum the- ory, the probabilistic space is geometrically defined, and the representation of it becomes an infinite set of angles and distance commonly named as an abstract vector space — or, more appropriately, a finite or infinite-dimensional Hilbert Space denoted by H. Each and every event is depicted as a subspace of the Hilbert Space. To represent the n-dimensional vectors that compose a Hilbert Space, the Dirac notation is widely adopted, using ket and bra nomenclatures. More concretely, this means representing one given vector ψ as |ψi and its trans- posed view, ψ T as hψ|. Also, the vectors under consideration in a Hilbert space are usually unit vectors (their length is 1). A projection onto a subspace induced 18 by a vector |ψi is denoted by the operation resulting in a matrix |ψi hψ|. In this subspace, the vectors contained within are again normalised2 , and the projection of events represented as vectors, again, is performed by the |ψi hψ| operation. Unit vectors interpreted as state vectors induce a probability distribution over events (subspaces), and the product resulting from the mentioned operation is called density matrix. We use so-called observables to perform a measurement of the outcomes (which are eigenvalues). The major similarity between quantum mechanics (QM) and information retrieval (IR) is understanding the interaction between a user (the observer in QM) and the information object under observation [28]. The core connection between QM and IR stems from the probabilistic features, where there is an observation of agreement for the preface of conditional probabilities allied with interference effects dominating to some contextual measure (cognitive, subjective character) when consolidating varied objects3 . In QT, we can represent user information needs with state vectors, and the query/observable, eigenvalues and the probability of obtaining single eigenvalues or objects as a measure of the degree of relevance to a query [8]. Earlier QM was incorporated withing the RL algorithmic approach to generalise on filtering favourable user actions [30]. Information Foraging Theory: Information Foraging theory (IFT) [4] was devel- oped to understand human cognition and their behaviours. IFT provides stipu- lated constructs adopted from optimal foraging theory which includes predators conforming to humans who seek for information (or prey). It has three con- structs, one of which delineates searches (or Search engine result page (SERP)s) in the user interface sections, referred to as information patches; information scent helps users make use of perceptual cues, such as web links spanning small snippets of graphics and text, consecutively to make their navigation decisions in selecting a specific link. The purpose of such cues is to characterise the contents that will be envisaged by trailing the links. Finally, information diet allows users to narrow or expand diversities of information sources based on their profitabil- ities (appetite). Information Foraging is an active area of IR and information seeking due to its sound theoretical basis to explain the characteristics of user behaviour. IFT has been applied to model users’ information needs and their actions using infor- mation scent [13]. However, it has been previously found that information scent can analyse and predict the usability of a website by determining the website’s scent [14]. Liu et al. [15] demonstrated an IFT-inspired user classification model for a content-based image retrieval system to understand the users’ search pref- erence and behaviours by functioning the model on a wide range of interaction features collected from the screen capture of different users’ search processes. Recent work [16, 17] studied the effects of foraging in personalised recommenda- tion systems by inspecting the visual attention mechanism to understand how 2 there may be some vectors which are not necessarily normalised 3 https://www.newscientist.com/article/mg21128285-900-quantum-minds-why-we- think-like-quarks/ 19 users follow recommended items. Such user-item interactions can also be seen in query-level interactions i.e., in query reformulation scenarios where IFT- and RL-like models [18, 19] provide better explainability. 3 Information Seeking As Reinforcement Learning Task A searcher during the search process has to investigate several actions before se- lecting any of it, with unknown reward. They explore each result back and forth to estimate the optimal patch based on the reward. This scenario of information seeking can be interpreted as a reinforcement learning task where the search pro- cess, involving an agent to interact with the search environment, is cost-driven. Assessing positively rewarded actions (from searcher’s incurred costs) by the agent within an uncertain environment can potentially optimise the foragers’ choice in finding the information. From an IFT perspective, positively rewarded actions can be drawn as exploitation whereas the available actions as explo- ration provided the information must be scattered between patchy environment. The fundamental aspect of reinforcement learning is to “learn by doing with delayed reward”, which emerges as a major connection to information seeking (especially user interaction in IR and recommendation tasks) and it also inter- prets the foraging process of a searcher. The seeker’s goal is to quickly locate a relevant patch (document, image, etc). However, the information seeker has no prior knowledge of the rewards from assessed patches and they keep exploring each of it. The seeker interacts with the search system to explore which results in relevant information elicits the rewards distribution (information scent patterns) between information patches; often the access to patches with minimum reward can signify an optimal patch that the seeker has spent less time on for exploita- tion. An information seeker spending less time assessing each information patch leads to partially-relevant information about the seeking process that elicits the rewards distribution between the patches, and it gives rise to exploitation of a patch with less than the optimal rewards. Hence, the longer a seeker explores, he/she consumes near-accurate information about all of the patches but gives up the chance to exploit the most relevant patch for long. Understanding these operationalised scenario paves the way to model foraging behaviour in which user causes could be uncertainty, information overload, and confusion. 4 Quantum-inspired Reinforcement Learning Framework We outline the proposed reinforcement learning approach to model the forager’s action during an information seeking scenario where the task is to match a query for a given document in which the forager actions are queries. An agent interacts with its search environment characterised by a patchy distribution of information to find an optimal foraging strategy to maximise its reward. The forager’s envi- ronment provides a fixed setting of optional information sources. Moreover, the forager has the choice to add a distinctive type of information patch into their diet. However, the distribution of distinctive information patches may consist 20 of information which the forager could likely not consume due to counterfac- tual situations in making decision amongst which patch (let us say document D1 and D2) contains certain information. In our framework, we consider the environment to be uncertain with dynamic parameters throughout a forager’s search trail. The forager finds it difficult to differentiate patches and exploits experience to learn the environment. The increasing learning makes it complex at the dynamic and cognitive level where the forager’s pursuit is to locate most relevant documents. We use the Actor-critic policy gradient method [22] which inherently models such dynamics due to the forager’s sequential behaviours that generate a con- tinuous state representation. A forager’s action (or state) can be described with the quantum superposition state and the corresponding updated state vectors, based on the respective interaction, can be achieved by random observation of the simulated quantum state based on the collapse principle of quantum mea- surement [28]. The probability of such an action state vector can be obtained by the probability amplitude which will be updated in parallel based on reward. This gives rise to new internal aspects in traditional RL algorithms which are policy, representation, action (in parallel) and operation update. The quantum measurement decision process of a forager in selecting a docu- ment (the action) while seeking is ambiguous and uncertain [6]. In such situation, an observable describes possible actions (documents or information patches to select) and can be represented as (Ô) with a base set containing |0i and |1i which corresponds to the two state vectors of Ô. The measurement of a quantum sys- tem on the observable (Ô) in a corresponding superposed quantum state (|ψi) refers to a measurement in superposition state. When making a measurement in state |ψi, the quantum state would collapse into one of its basis states |0i or |1i. However, one cannot obtain a prior with certainty whether it will collapse to either of these states. The only information this quantum system can provide is 2 2 |0i will be measured with probability |α| or |β| as the probability to measure |1i, where α and β represent the respective probability amplitudes. We present a quantum-inspired reinforcement learning (qRL) framework for information seeking under dynamic search scenarios. The schematic architecture of qRL is shown in Fig. 1. qRL has two main components, an Actor-critic [22] based network to represent the RL agent which jointly encodes state and action spaces, and the information space known as environment containing documents. The Actor-critic components of an RL agent have their constructs subscribed via the Hilbert space formalism of Quantum theory [28]. Our framework is applicable to matching tasks, in particular, semantic query matching where candidate queries (extracted/predicted queries from the docu- ment) with the original document will be matched in a semantic Hilbert space (SHS) [33]. An SHS is a vector space of words, where words in combination in- volve a linear/non-linear formation of amplitudes and phases, delineating various level of semantics of combined words. In the SHS, a word wi is represented by a base vector |wi i. Semantics of combined words are represented by superpositions 21 Fig. 1. Our proposed framework: Quantum-inspired Reinforcement learning-driven model for information seeker (in a semantic query matching task) of word vectors, encoded in the probability amplitudes of the corresponding base vectors. 4.1 Preliminaries The standard reinforcement learning is based on a finite state, discrete time Markov decision process (MDP) composed of five components: st , at , pij (a), ri,a and C, where st , the state at time t, delineates at the action at a specific time for a given state; pij (at ) is the probability of state transition (from state st to st+1 via action at for all t ∈ (i, j)), r is a reward function where r : Γ −→ R with Γ = {(i, a) | i ∈ st , a ∈ at }, and C is an objective function. In the following discussion we utilise tensor spaces. The notation in Table 1 follows those in [35, 34]. The fabric of our framework, i.e. the underlying Hilbert 22 space H, is similar to the Tensor Space Language Model described in [34]. Here, the base vectors {|φi i}ni=1 of our n-dimensional space4 are term vectors, either one-hot vectors or word embeddings. Any word vector Pn |wi can be written as linear combination of the base vectors, i.e. |wi = i=1 αi |φi i with αi ∈ (or R C in the complex case) as coefficient. Table 1. Notations used in Reinforcement Learning Constructs, following [35, 34]. bi depicts the dimension of orthonormal basis of Hilbert space, ⊗ depict the tensor product, R depicts the rank of G and L has n-order tensor of rank 1. Notation Interpretation Description αi,bi bi ∈ {1, ..., k} Probability amplitude |φbi i Semantic meaning Basis vector (n (word vectors) or kn dimension for tensor product of ba- sis vectors) Pk |wi i bi =1 αi,bi |φbi i Word state vector |qi i |w1 i ⊗ |w2 i ... ⊗ |wn i Query state vector Yn Pk ψqT b1 ,...,bn =1 ( αi,bi |φbi i ⊗ ... ⊗ |φbn i) Local representation (L is a kn di- i=1 | {z } mensional tensor) Lb ,...,bn i Pk |ψq i b1 ,...,bn =1 Gb1 b2 ...bn |φbi i ⊗ ... ⊗ |φbn i Global representation of combined meanings/patches PR G r=1 wr · er,1 ⊗ er,2 ⊗ ... ⊗ er,n Probability amplitude (semantic space of meaning) Yn Pk ψqT ψq b1 ,...,bn =1 Gb1 ...bn × αi,bi Projection of the global representa- | {z i=1 } tion to the local representation of a P robability amplitudes query Qn Pk State i=1 bi =1 er,i,bi · αi,bi Actor network state module (prod- uct pooling layer [35]) |at i (|a1 i , |a2 i , ..., |aR i)T Output of the Actor network The overview of the RL process which possesses the Markov decision process is as follows: Agent: In general, an agent acts as a controller within an information environment and is the one executing actions. In our framework, the RL agent is a forager (information seeker) available to the search environment (documents) delivering queries as actions, where the action is chosen based on the Actor-critic network (or the Policy network). Action: An action at in the Web search scenario conforms to a query that searchers utilise to express their information need with the aim to either retrieve a document as an outcome of the query or continue the search process (exploratory search) (a 4 The Hilbert space can be over the real or complex field, i.e. R or C ; we are n n assuming R for the further discussion 23 formal representation of the user action is shown in Table 1). In our framework, the forager (or searcher) action is to match a candidate query |qi (generated after inputting a set of queries) from document D to delineate |qrD i, where |qr Di refers to a query state vector that represents the most optimal query for the selected document D given a positive/optimal reward (r). A candidate query is an outcome generated from the Actor network given the forager set of input queries. State: A state st delineates the positive historical interaction of the forager with the search environment. In our framework, the Actor network has its state encoded by the product of the probability amplitudes of global-local projection ψqT ψq (of word meanings) for all words of a query We refer to the state representation defined by the product pooling method. State Transition: The state representation describes the positive historical interac- tion of a forager. The transition among the states can be computed from the user’s feedback. Our framework uses a convolutional neural network which has its convo- lution based on a state vector that encodes the historical interaction of the forager in finding the match of a query. Policy: Policy is a strategic mechanism which represents the probability of a forager’s action under a certain state. Our framework’s policy network is stochastic, and we employ the Actor-Critic RL method [22] (Fig. 1) which assists the forager actions in the Actor network with a optimal policy value generated from the Critic. Thus, the Actor network estimates the probability of a forager action, and the Critic network gets the optimal value and updates it. The policy network is modelled as a probability distribution over actions and hence it is stochastic. Reward: Reward (r(s, a)) in reinforcement learning is the success value of an agent’s action a. This success value in information retrieval is interpreted in terms of the relevance judgement score [7]. Our framework process to receive reward values for the Critic network which inputs a pair of (state, action) that provides to the Actor network as an optimal reward for a given action which judges and scores the actions of the agent (or forager). 4.2 Our Proposed Framework This fundamental RL definition is of utmost importance for proposing quantum-inspired reinforcement learning constructs. Following the quantum probability concepts, below are the constructs as follows (please also refer to Figure 1): Actor Network: An Actor-critic [22] method refers to as policy gradient mecha- nism, where the Actor network for a given forager (or information seeker) in a particular state |st i outputs an action |at i. This network inputs user queries (forager’s actions), where these queries |q1 i , |q2 i , ..., |qn i or a set of textual descriptions (which collectively form a document) form the local and global representations so as to model the inter- related interaction between words. Inspired by the notion of quantum theory, we employ the interpretation of wave function (due to the importance of word positions [36]) |ψi as a state vector that can be explicated in RL constructs. The Actor network inputs query state vectors |q1 i . . . , |qn i, where each word in a query is treated as a tensor product of vectors |wi i and every word has a unique basis vector |φbi i that provides a generic semantic meaning with an associated prob- ability amplitude. The speciality of a basis vector is that it can lead to a different meaning if interpreted severally across it. Then, we apply our framework in a se- mantic query matching task by a real-valued representation of queries by means of 24 local and global distribution so to allow such intermittent basis vectors that per- ceive the interaction between the meaning of different words. Hence, the wave func- tion description of a query |qi i can be depicted using the tensor product of words as ψqT = |w1 i ⊗ |w2 i ... ⊗ |wn i. A word dependency can be seen by expansion of ten- sors as ψqT = kb1 ,...,bn =1 Lb1 ...bn |φbi i ⊗ ... ⊗ |φbn i, where L (the value is shown in P Table 1) depicts the allied probability amplitude of kn dimensional tensor in which it has the respective basis vectors |φbi i , ..., |φbn i representing the meaning of the cor- responding query. This tensor-based query representation is a local representation as a tensor with rank 1 actually delineates the local distribution of a query [34]. For words that are unseen in a query or compound meanings we need a global repre- sentation of them provided by a collective set of basis states (or vectors). A state vector P (i.e., wave function of a query) to describe such a global representation is |ψq i = kb1 ,...,bn =1 Gb1 ...bn |φbi i ⊗ ... ⊗ |φbn i. This wave function delineates a semantic embedding space of n uncertain word meanings of a given query. The local and global representation differs in terms of their corresponding probability amplitudes i.e., L and G, in which the probability amplitudes of the global distribution will be trained on a large collection of previous queries whereas the probability amplitudes of local distribution relates only to the input query. To compute the probability amplitudes among words from the input query (local representation) and unseen words generated from the global representation, we perform the inner product ψqT ψq of both rep- resentations that disentangle the interaction among it. The value of the projection is shown in Table 1. We use a convolutional neural network (CNN) to learn the obtained higher-dimensional tensor G (value shown in Table 1), where tensor rank decomposi- tion can be used to decompose it (among other methods such as generalised singular value decomposition) and the decomposed unit vector er,n with each rank 1 tensor of weight coefficient wr . The unit vector is k-dimensional and the set of vectors er,n acts as a subspace of tensor G. The CNN inputs a query state vector with a convolution filter composed of the projection (inner product) among the |qi and the decomposed vector, which makes the CNN trainable. Then, the state representation (actor’s state in Table 1) performs the product of all mapped unit vectors (from G) for all the sub- words of a query. After all these operations, the Actor network yields an action state vector |at i (action at at time t) to depict a set of matched words. Critic Network: The Critic network of the qRL framework is based on a quantum- like language model parameterised CNN which inputs the generated state and the candidate action |at i from the Actor network. The output of the Critic network is a scalar value or value of the Q-function [10]. The reward values Re ∈ [−1, 0, 1] reflect the ability of the candidate action generated by the Actor network. The significance of the reward value represents the probability of designating the correct label to action i.e, the multi-class classification of queries to match among documents will be used to update the reward. Rewards (or classification labels) are categorised as -1 for a mis- matched query which has negative word polarity (leading to a compound meaning). For instance, ”dogs chase cats” and ”dogs do not chase cats” contribute to a compound meaning itself but in an opposite sense. We tend to consider that a word renders the entire polarity of a query, provided to which new word it associates with. A realistic example of this hypothesis can relate to one of our framework’s main constructs i.e., |qi which is a state vector equal to the tensor product of possible words, where the word coefficients (i.e., probability amplitudes) of basis vectors can be altered to derive a new query giving rise to a compound meaning. The negative word polarity example is an 25 actual scenario of it. Positive and zero rewards are classed as matching and partially matching for queries. In the Critic part, the concatenation of the actor’s state and candidate action is performed using one-hot encoding in which the query is passed via a complex-valued lookup table, where each word in their own superposition state is encoded into a com- plex embedding vector [32]. Then, a measurement is performed using the square pro- jection to compute the query density matrix from the complex embedding vectors. The probability of a measurement can be estimated using the Born’s postulate for a given query state ρ (a density matrix) which is p = Tr(P ρ), where p, P and Tr represents the class of the Pquery, projection matrix, and trace of a matrix, respectively. The density state ρ = n i=1 βi |wi i hwi | of a query is perceived as the word states in combination, provided thatP the density matrix (|wi i hwi |) reflects a word (wi ) in superposition state (in this case n i=1 βi = 1). The generated query density matrix has its diagonal and non-diagonal entries as real-valued and complex nonzero values, and both type of en- tries inherently inform about the distribution of semantic and contingent meanings. We adopt the interpretation of complex phase introduced in [32] to compute the sentence density matrix which has word senses as positive, neutral, and negative. The reward is estimated using such interpretation from the measurement matrix. A pictorial rep- resentation of the Critic network is shown in Fig. 1. In brief, the Actor-Critic policy network helps suffice the number of components with respect to traditional reinforcement learning. Also, the Agent part of the frame- work acts as a controller for the user in the same way Information Foraging mechanisms possess to a searcher. IFT helps a searcher through suggesting an optimal foraging path via information scent, and here in the framework the Critic network informs/updates the Actor with a value (reward) for a certain action that is positively rewarded. Hence, our framework meets foraging in certain regards (such as information seeking behaviour assessed as foraging and inherently as RL task). Rewards: The forager aim is to identify the relevant match (or a perfect match) of a query (or patch) for the clicked/selected document that can be perceived as its reward. However, our framework’s reward function is designed in a way to guide the forager on how to perceive the document information and draw the most relevant match (patch). Also, the reward value is discrete as it revolves around -1, 0, and +1. The definition of reward in reinforcement learning [10]5 resembles a certain analogy of information scent, which is a measure of utility and results in two types of information scent score – a scalar value and the probability distribution of scent patterns [18]. In RL, the perspective of value distribution of received reward by an agent can depict the analogous nature of information scent patterns. Hence, an explainable approach of reinforcement learning-based rewards using the IFT-based model of information scent can give further intuition to negative rewards. Information scent can be interpreted as the perceived relevance of rewarded actions defined as positive and negative scent values. The physical meaning of positive and negative information scent scores are that the forager accumulates rich information along the path he/she foraged to locate the relevant information, and the unhealthy consumption of information reckon searcher negative towards the search environment which leads them to give up the information world (or RL environment) or task itself. Update Probability Amplitude/Policy: To update the probability amplitude in the Actor network, the important part is to measure the actions for some certain 5 Rewards can be normalised to generate outcomes in reinforcement learning [10] 26 states which on collapse will give rise to the occurrence probability of the norm of state vector for the particular candidate action, which later will execute the Actor network. The more we record the experience and learning of each action (even erroneous action), the probability amplitude becomes more informative. We know that the action |at i is the tensor product of all possible words and to calculate one user action (i.e. |ai) from it can be possible while interacting with changes in probability amplitudes for the combined meaning. 5 Conclusion and Future Work In this paper, we propose a mathematical framework of reinforcement learning in- spired by the Hilbert space formalism in Quantum theory. The framework models the learning process of forager actions in a semantic query matching task given the search environment is patchy. The core of our framework is to characterise a forager with very little or unclear information about their search pattern, unclear or evolving in- formation need and features. Also, no information about how a forager makes their trail (initially the information scent is unknown and emanates as it follows via distinct cues) choice during finding information and the amount of information they consume in real-time interaction with the search system. Apart from this, the major trade-off situation of exploration and exploitation in the foraging process makes the process of understanding about the forager’s search actions complex. To tackle such a com- plex process of dynamic action for a state and vice-versa, we adapt the Actor-critic reinforcement learning method as a policy network, in which the actor network is con- tinuously informed about the generated action from the critic network. The framework subscribes to the quantum probability constructs to model by the representation of forager search actions and states. Quantum theory has been earlier applied in the area of information seeking [6], but representing and measuring actions of each state is a challenging scenario due to the continuous update of state-action in parallel, so using the Actor-critic reinforcement learning method paves the way to influence learning and representation mechanisms; many complex IR problems could be interpreted appro- priately in a new way within such an inclusive framework. In the future, we intend to evaluate this framework for certain IR tasks. Acknowledgements This work is part of the Quantum Access and Retrieval Theory (QUARTZ) project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 721321. References 1. Pirolli, P., & Card, S. (1995, May). Information foraging in information access envi- ronments. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 51-58). 2. Chowdhury, S., Gibb, F., & Landoni, M. (2011). Uncertainty in information seek- ing and retrieval: A study in an academic environment. Information Processing & Management, 47(2), 157-175. 27 3. Tran, V. T., & Fuhr, N. (2012, August). Using eye-tracking with dynamic areas of interest for analyzing interactive information retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 1165-1166). 4. Pirolli, P., & Card, S. (1999). Information foraging. Psychological review, 106(4), 643. 5. Brennan, K., Kelly, D., & Arguello, J. (2014, August). The effect of cognitive abilities on information search for tasks of varying levels of complexity. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 165-174). ACM. 6. Wittek, P., Liu, Y. H., Darányi, S., Gedeon, T., & Lim, I. S. (2016). Risk and ambiguity in information seeking: Eye gaze patterns reveal contextual behavior in dealing with uncertainty. Frontiers in psychology, 7, 1790. 7. Tang, Z., & Yang, G. H. (2019). Dynamic Search–Optimizing the Game of Infor- mation Seeking. arXiv preprint arXiv:1909.12425. 8. Piwowarski, B., Frommholz, I., Lalmas, M., & Van Rijsbergen, K. (2010, October). What can quantum theory bring to information retrieval. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 59- 68). 9. Charnov, E. L. (1976). Optimal foraging, the marginal value theorem. 10. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. 11. Chandak, Y., Theocharous, G., Kostas, J., Jordan, S., & Thomas, P. (2019, May). Learning Action Representations for Reinforcement Learning. In International Con- ference on Machine Learning (pp. 941-950). 12. White, R. W. (2014). Belief dynamics in Web search. Journal of the Association for Information Science and Technology, 65(11), 2165-2178. 13. Chi, E. H., Pirolli, P., Chen, K., & Pitkow, J. (2001, March). Using information scent to model user information needs and actions and the Web. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 490-497). ACM. 14. Chi, E. H., Pirolli, P., & Pitkow, J. (2000, April). The scent of a site: A system for analyzing and predicting information scent, usage, and usability of a web site. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (pp. 161-168). ACM. 15. Liu, H., Mulholland, P., Song, D., Uren, V., & Rüger, S. (2010, August). Apply- ing information foraging theory to understand user interaction with content-based image retrieval. In Proceedings of the third symposium on Information interaction in context (pp. 135-144). ACM. 16. Jaiswal, A. K., Liu, H., & Frommholz, I. (2019). Effects of Foraging in Personalized Content-based Image Recommendation. arXiv preprint arXiv:1907.00483. 17. Jaiswal, A. K., Liu, H., & Frommholz, I. (2019, December). Information Forag- ing for Enhancing Implicit Feedback in Content-based Image Recommendation. In Proceedings of the 11th Forum for Information Retrieval Evaluation (pp. 65-69). 18. Jaiswal, A. K., Liu, H., & Frommholz, I. (2020, April). Utilising information for- aging theory for user interaction with image query auto-completion. In European Conference on Information Retrieval (pp. 666-680). Springer, Cham. 19. Nogueira, R., Bulian, J., & Ciaramita, M. (2018). Learning to coordinate multi- ple reinforcement learning agents for diverse query reformulation. arXiv preprint arXiv:1809.10658. 28 20. Du, J. T., & Spink, A. (2011). Toward a web search model: Integrating multitask- ing, cognitive coordination, and cognitive shifts. Journal of the American Society for Information Science and Technology, 62(8), 1446-1472. 21. White, R. W., Bennett, P. N., & Dumais, S. T. (2010, October). Predicting short- term interests using activity-based search context. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1009- 1018). ACM. 22. Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Abbeel, O. P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Ad- vances in neural information processing systems (pp. 6379-6390). 23. Zhang, B. T., & Seo, Y. W. (2001). Personalized web-document filtering using reinforcement learning. Applied Artificial Intelligence, 15(7), 665-685. 24. Seo, Y. W., & Zhang, B. T. (2000, January). A reinforcement learning agent for personalized information filtering. In Proceedings of the 5th international conference on Intelligent user interfaces (pp. 248-251). ACM. 25. Eliassen, S., Jørgensen, C., Mangel, M., & Giske, J. (2007). Exploration or ex- ploitation: life expectancy changes the value of learning in foraging strategies. Oikos, 116(3), 513-523. 26. Yue, Y., & Joachims, T. (2009, June). Interactively optimizing information re- trieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1201-1208). ACM. 27. Balabanović, M. (1998). Exploring versus exploiting when learning user models for text recommendation. User Modeling and User-Adapted Interaction, 8(1-2), 71-102. 28. Van Rijsbergen, C. J. (2004). The geometry of information retrieval. Cambridge University Press. 29. Von Neumann, J. (2018). Mathematical Foundations of Quantum Mechanics: New Edition. Princeton university press. 30. Fakhari, P., Rajagopal, K., Balakrishnan, S. N., & Busemeyer, J. R. (2013). Quan- tum inspired reinforcement learning in changing environment. New Mathematics and Natural Computation, 9(03), 273-294. 31. Zhou, J., & Agichtein, E. (2020, April). RLIRank: Learning to Rank with Rein- forcement Learning for Dynamic Search. In Proceedings of The Web Conference 2020 (pp. 2842-2848). 32. Li, Q., Uprety, S., Wang, B., & Song, D. (2018). Quantum-inspired complex word embedding. arXiv preprint arXiv:1805.11351. 33. Wang, B., Li, Q., Melucci, M., & Song, D. (2019, May). Semantic Hilbert space for text representation learning. In The World Wide Web Conference (pp. 3293-3299). 34. Zhang, L., Zhang, P., Ma, X., Gu, S., Su, Z., & Song, D. (2019, July). A generalized language model in tensor space. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 7450-7458). 35. Cohen, N., Sharir, O., & Shashua, A. (2016, June). On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory (pp. 698-728). 36. Wang, B., Zhao, D., Lioma, C., Li, Q., Zhang, P., & Simonsen, J. G. (2019, Septem- ber). Encoding word order in complex embeddings. In International Conference on Learning Representations. 29