III.4.3 Contextual Markov Decision Process 147 - ▷(·) : R↑Ris an activation function, which may be non-linear. - For eachˆl =1, . . . , Lˆ: - Wˆl ↓ R dˆl≃dˆl ↑1 is the weight matrix of the ˆl-th layer, mapping inputs from dimension dˆl ↗1 to dˆl. - bˆl ↓Rdˆl is the bias vector for the ˆl-th layer. - The function fˆl : R dˆl ↑1 ↑Rdˆl , defined by fˆl (x) =▷(Wˆlx+bˆl ), describes the operation of the ˆl-th layer, where x ↓R dˆl ↑1 is the input to the layer. - The function f : Rn ↑Rm, defined as f =f Lˆ ▽fLˆ ↗1 ▽· · · ▽ f1, represents the overall network operation from input to output. - L(·) : Rm↘Rm↑R is the loss function that quantifies the error between the network’s output ˆy =f(x) and the target output y, which is crucial for training the network by adjusting the weights and biases. From Definitions 13 and 14, DRL involves modelling an agent Awith a DNN-based function f : Rds ↑Rda, where ds =dim(S) and da =dim(A) denote the state and action space dimensions, and f aims to approximate π↓ during training. X. Wang, S. Wang, Liang, et al., 2024 discusses various families of methods employed to address DRL problems, including value-based, policy-based, and maximum entropy-based methods. In this dissertation, we utilise Proximal Policy Optimisation, a method belonging to the policy-based family, which is examined in greater detail in Section III.4.4. III.4.3 Contextual Markov Decision Process A Contextual Markov Decision Process (CMDP) extends MDPs (see Definition 12, page 145) by incorporating context. The goal of a CMDP is to learn a policy that optimises cumulative reward, accounting for varying hidden static parameters known as the context (Hallak, Di Castro, and Mannor, 2015). Below, we provide a formal definition. Definition 15 (Contextual Markov Decision Process). A Contextual Markov Decision Process (CMDP) is formally described by the tuple Mc =↔C, S, A, K(c)↗, where: - C is the context space, representing a set of all possible static parameters that influence the decision process. - S and Adenote the state and action spaces, respectively. - K(c) is the function that maps any context c ↓ C to its corresponding MDP (see Definition 12, page 145). Each mapped MDP is characterised by the tuple K(c)=↔S, A, P(c), R(c), π (c) 0 , ⇀↗, where: - π (c) 0 is the initial probability distribution influenced by c. - P(c)(st+1|st, at) is the transition probability function influenced by c. - R(c)(st, at, st+1) is the reward function influenced by c.
RkJQdWJsaXNoZXIy MjY0ODMw