148 Part III: Maintenance optimisation of multi-state components III.4.4 Proximal Policy Optimisation The Proximal Policy Optimisation (PPO) algorithm (Schulman, Wolski, Dhariwal, et al., 2017) optimises an agent’s policy to maximise expected returns while maintaining training stability and computational e"ciency. PPO achieves this stability through an objective function that penalises large deviations from the previous policy, thereby keeping new and old policies closely aligned, reducing the risk of performance collapse. The introduction of a “clipped” surrogate objective function is a key innovation of PPO, enhancing stability and making it widely adopted in various DRL applications. Clipped surrogate objective function OCLIP(◁): Let T ⇐Rrepresent a time horizon of fixed length, and let t ↓T denote a time indicating a decision point. The clipped surrogate objective function in PPO is given by: OCLIP(◁)= ˆEt min(εt(◁) ˆAt, clip(εt(◁),1→,,1+,) ˆAt), (6.11) where ◁denotes the policy parameters. The termεt(◁)= ςω(at|st) ςωold(at|st) represents the probability ratio of action at ↓Ain state st ↓S under the new policy πφ relative to the old policy πφold. ˆAt is an estimator of the advantage function, indicating the relative benefit of an action. The clipping function clip(·) modifies εt(◁) so that it remains within the interval [1→,,1+,], here , typically lies between 0.1 and 0.2. The expectation ˆEt[·] signifies an empirical average over a sample batch. Advantage function ˆAt: The advantage function ˆAt is usually estimated using generalised advantage estimation, calculated as: ˆAt =0t +(⇀1)0t+1 +(⇀1) 2 0t+2 +· · · +(⇀1) T↗t↗1 0T↗1 , (6.12) here, 0t =rt +⇀V(st+1) →V(st) is the temporal di!erence error for state st ↓S and reward rt ↓R; ⇀ is the discount factor valuing current over future rewards, and 1 ↓[0,1] helps balance the bias-variance trade-o! in ˆAt. V(st+1) and V(st) are the value functions (see Eq.6.7, page 146) for the states st+1 and st, respectively. Training with PPO: Training involves sampling data through the agent’s (A) interaction with the environment (E) while executing the policy πφold. Subsequently, the advantage function ˆAt (Eq. 6.12) is estimated, and the clipped surrogate objective function OCLIP(◁) (Eq. 6.11) is optimised using stochastic gradient descent on the policy parameters ◁. This process is repeated with the updated policy until meeting
RkJQdWJsaXNoZXIy MjY0ODMw