Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!

1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!


Member Since 14 Aug 2013
Offline Last Active Aug 14 2013 10:27 AM

#5085856 Understanding how to train Neural Networks for Control Theory Q-learning and...

Posted by Sevren on 14 August 2013 - 10:27 AM

Hi everyone. I've been learning about Reinforcement learning for the past little bit in an attempt to learn how to create a agent that I could use in a game i.e driving a car around a track. I want to learn how to combine the Neural network architecture with RL such as Q-learning or SARSA. 

Normally in Error- back propagation Neural Networks you have both input and a given target
i.e xor pattern input is 0 0 or 1 1 or 0 1 1 0 and the target is either 0 or 1. This is given so it is easy for me to see where to plug in the values for my error back prop function. The problem for me now is given only the state variables  in my testing problem of Mountain car or pendulum how do I go about using Error- back propagation? 

Since I first want to build an agent that solves Mountain car as a test Is this the right set of steps?

S =[-0.5; 0] as the inital state ( input into my neural network)

  1. create network (2, X-hidden units,3) -> 2 inputs position and velocity  and either 1 ouput or 3 outputs corresponding to actions, with Hidden activation function is sigmoid(tanh) and output is purelin
  2. Now run the state values for position and velocity into the network (Feed forward) and get 3 Q values as output, it's 3 outputs as that is how many actions I have. 
  3. select an action A using e-greedy, either a random one or the best Q-value giving me which action to choose from this state.
  4. Execute action A for the problem and receive new state S' and reward
  5. Run S' through the neural network and obtain Q S' values

Now I guess I need to compute a target value... given Q-learning where Q(s,a) = Q(s,a)+alpha*[reward+gamma* MAX Q(s',a') -Q(s,a)]
I think my Target output is calculated using:  QTarget=reward+gamma*MAX Q(s',a') right?

So that means now i choose the max Q-value from step  5 and plug it into the QTarget  equation 

Do I calculate an Error again like in the original backprop algo?

So Error=QTarget-Q(S,A) ?

and now resume normal Neural Network backprop weight updates?

Thanks, Sevren