Jump to content
  • Advertisement
Sign in to follow this  

Understanding how to train Neural Networks for Control Theory Q-learning and SARSA

This topic is 2169 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi everyone. I've been learning about Reinforcement learning for the past little bit in an attempt to learn how to create a agent that I could use in a game i.e driving a car around a track. I want to learn how to combine the Neural network architecture with RL such as Q-learning or SARSA. 

Normally in Error- back propagation Neural Networks you have both input and a given target
i.e xor pattern input is 0 0 or 1 1 or 0 1 1 0 and the target is either 0 or 1. This is given so it is easy for me to see where to plug in the values for my error back prop function. The problem for me now is given only the state variables  in my testing problem of Mountain car or pendulum how do I go about using Error- back propagation? 

Since I first want to build an agent that solves Mountain car as a test Is this the right set of steps?

S =[-0.5; 0] as the inital state ( input into my neural network)

  1. create network (2, X-hidden units,3) -> 2 inputs position and velocity  and either 1 ouput or 3 outputs corresponding to actions, with Hidden activation function is sigmoid(tanh) and output is purelin
  2. Now run the state values for position and velocity into the network (Feed forward) and get 3 Q values as output, it's 3 outputs as that is how many actions I have. 
  3. select an action A using e-greedy, either a random one or the best Q-value giving me which action to choose from this state.
  4. Execute action A for the problem and receive new state S' and reward
  5. Run S' through the neural network and obtain Q S' values

Now I guess I need to compute a target value... given Q-learning where Q(s,a) = Q(s,a)+alpha*[reward+gamma* MAX Q(s',a') -Q(s,a)]
I think my Target output is calculated using:  QTarget=reward+gamma*MAX Q(s',a') right?

So that means now i choose the max Q-value from step  5 and plug it into the QTarget  equation 

Do I calculate an Error again like in the original backprop algo?

So Error=QTarget-Q(S,A) ?

and now resume normal Neural Network backprop weight updates?

Thanks, Sevren


Edited by Sevren

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!