Jump to content

  • Log In with Google      Sign In   
  • Create Account

Understanding how to train Neural Networks for Control Theory Q-learning and SARSA


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
No replies to this topic

#1 Sevren   Members   -  Reputation: 105

Like
1Likes
Like

Posted 14 August 2013 - 10:27 AM

Hi everyone. I've been learning about Reinforcement learning for the past little bit in an attempt to learn how to create a agent that I could use in a game i.e driving a car around a track. I want to learn how to combine the Neural network architecture with RL such as Q-learning or SARSA. 

Normally in Error- back propagation Neural Networks you have both input and a given target
i.e xor pattern input is 0 0 or 1 1 or 0 1 1 0 and the target is either 0 or 1. This is given so it is easy for me to see where to plug in the values for my error back prop function. The problem for me now is given only the state variables  in my testing problem of Mountain car or pendulum how do I go about using Error- back propagation? 

Since I first want to build an agent that solves Mountain car as a test Is this the right set of steps?

S =[-0.5; 0] as the inital state ( input into my neural network)

  1. create network (2, X-hidden units,3) -> 2 inputs position and velocity  and either 1 ouput or 3 outputs corresponding to actions, with Hidden activation function is sigmoid(tanh) and output is purelin
     
  2. Now run the state values for position and velocity into the network (Feed forward) and get 3 Q values as output, it's 3 outputs as that is how many actions I have. 
     
  3. select an action A using e-greedy, either a random one or the best Q-value giving me which action to choose from this state.
     
  4. Execute action A for the problem and receive new state S' and reward
     
  5. Run S' through the neural network and obtain Q S' values

Now I guess I need to compute a target value... given Q-learning where Q(s,a) = Q(s,a)+alpha*[reward+gamma* MAX Q(s',a') -Q(s,a)]
I think my Target output is calculated using:  QTarget=reward+gamma*MAX Q(s',a') right?

So that means now i choose the max Q-value from step  5 and plug it into the QTarget  equation 

Do I calculate an Error again like in the original backprop algo?

So Error=QTarget-Q(S,A) ?

and now resume normal Neural Network backprop weight updates?

Thanks, Sevren


 


Edited by Sevren, 14 August 2013 - 10:29 AM.


Sponsor:



Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS