Action s

Comment

Author: Admin | 2025-04-28

Evaluated parameter configuration, λ, and the corresponding reward r ∈ R observed to the previous state s ∈ S : s ′ = τ ( s , a , r ) = s , ( λ = g ( a ) , r ) . (4) The agent reaches the terminal state in case of exceeding the prescribed budget T. At each step t ∈ T , agent study the data d ∈ D , the state s t = ( d t , ( λ 0 , r 0 ) , … , ( λ t , r t ) ) and the next step s t + 1 = ( d t + 1 , ( λ 0 , r 0 ) , … , ( λ t , r t ) , ( λ t + 1 , r t + 1 ) ) . This means each state s includes all previously parameter configurations and their corresponding response. The budget could be the running time/target reward is reached or the same parameter set is selected twice in a row. The last condition causes the agent to keep on exploring the parameter space without getting stuck in a specific reward configuration. 3.2. Artificial Intelligent AgentThe agent interacts with the environment 𝜀 with the task of maximizing the expected discounted reward. They execute actions from the action space and receive observations and rewards. At each time step, which ranges over a set of discrete time intervals, the agent selects an action a at state s. The behavior of the agent is governed by a stochastic policy, π : S → A , which tells the agent which actions should be selected for each possible state. As a result of each action, the agent receives a scalar reward r, and observes the next state s ′ . The policy is used to compute the true state-action value, Q π ( s , a ) , as: Q π ( s , a ) = E π ∑ t = 0 ∞ γ t r t | s 0 = s , a 0 = a , (5) where γ ∈ [ 0 , 1 ] represents the discount factor balancing between immediate and future rewards. This basically helps to avoid infinity as a reward in case the task has no terminal state.The aim of the agent is to learn an optimal policy which defines the probability of selecting action that maximizes the discounted cumulative reward, π * ( s ) ∈ a r g m a x a Q * ( s , a ) , where Q * ( s , a ) denotes the optimal action value.

Add Comment