XXX_Andrew_XXX 340 Report post Posted November 7, 2006 I'm not sure if anyone can help (Timkin?) - I'm trying to get my head around reinforcement learning (gah, it's starting to get me down, not often that I've feel so bogged down trying to get into something new). I'm working my way through some of the examples in the Sutton and Barto book (which rocks) and refering to the Bertsekas books as I get my head around the maths. I'm trying to skip a few steps here to head in a direction more relevant to my work. Be warned, I'm just starting to learn about this stuff - if you want to correct my terminology or suggest anything please do. I'm trying to get started on learning optimal control techniques for vehicles in obstacle free environments, with the aim of using them as improved metrics for planning in environments with obstacles. I'm trying to get started with the Dubins car, mainly because I've seen visualisations of optimal trajectories. I'm reasonably sure I understand what I need to do but some suggestions on how to manage and partition the state space would be fantasic. I assume I need some type of tiling system (CMAC) to partition the state space. To get started I converted the standard dubins car equations and have converted them into polar form (as is standard for optimal control work). The inputs to the model are forward velocity (assumed to be one) and turn rate which is restricted to {-1,0,1}. The state space is made up of rho, alpha and beta. Alpha and beta are angles, so that makes partioning them is easy. rho is the distance between the vehicle and the goal, which makes partitioning much harder because it is unbounded. 1. Is what I'm trying to do - learn the optimal action-value function Q(s,a) possible (or reasonable) for this type of problem? I'm pretty sure that it is, I've seen other research which relies on these types of results. 2. Do I need to place some type of artificial bound on rho to be able to partition the state space? what implications does this have on the solutions, particularly over time. 3a. Do my tiles need to be the same size? I'm pretty sure they don't need to be 3b. Can some tiles be unbounded? I assume that this would really muck up the Q values (make them infinity). If this isn't the right place to talk about this I'm more than willing to email or PM anybody with experience in this area 0 Share this post Link to post Share on other sites
Timkin 864 Report post Posted November 7, 2006 Doh, the reference book I want is at home on the dining table...Off the top of my head, you don't have to discretise the state space, you just need to solve a Hamilton-Jacobi-Bellman equation for the value function you choose. Although, if I remember correctly, the Dubins Car involves non-holonomy, so that might make it tougher. I've got a book on viscousity solutions of HJB equations at home which covers this sort of problem. I'll post something on it tonight.If you want to specfically use Q-learning to learn a stationary policy, then yes it would be easier if you discretised the state space, since you then have merely a lookup table for your optimal control. However I'm not convinced that's the best way to solve this problem... unless there's some other reason constraining you to discrete spaces.Two other quick questions: (1) from you post, it sounds like you're trying to do waypoint tracking. Have you instead considered having the vehicle try and follow a specified parametric trajectory (such as a spline curve)? This is a slightly easier problem since you're looking for a policy to minimise the distance-to-track error. Of course, any waypoint course is also a piecewise linear spline! ;)(2) You're from Melbourne... and doing a PhD... where?Cheers,Timkin 0 Share this post Link to post Share on other sites
XXX_Andrew_XXX 340 Report post Posted November 7, 2006 My PhD is through Swinburne, but I'm not on site much.The reason for discretizing the state space is that the research problem doesn't really have anything to do with the dubins car - we're working on much more complex vehicles. Mainly AUV's but with faciltiy for UAV's (including aerosondes). Dubins was going to be the first step, then move onto more relevant models like dubins with steering acceleration before starting work on our vehicles (which are currently gain scheduled linear models). At this point I'm assuming that the problem will be stationary.My thesis is concentrated on obstacle avoidance for AUV's. I am particularly interested in trying to get my head around the work of Frazzoli on Manuever Automatons http://rigoletto.seas.ucla.edu/papers/Frazzoli.PhD01.pdf . One point the made in the thesis is that if you can do optimal control in an obstacle free environment you can use those results as a lower bound metric to improve the performance of probabilistic path planners like RRT's when you do have obstacles. Because of the application I've got in mind, tracking rate commands is much more important than tracking waypoints, but this aspect of my problem is more about having some type of estimate of the optimal cost to go.I've seen lots of reference to the HJB equations, but I'm really unsure of where I should even start looking if I want to setup the equations for the dubins car let alone anything more complex.Any references or suggested reading would be really appreciated. 0 Share this post Link to post Share on other sites
Timkin 864 Report post Posted November 11, 2006 Sorry for the delay in a followup to my earlier post... I'm flat out getting a presentation ready for a one day conference tomorrow... but I'll be back to my semi-normal state on Tuesday and hopefully I'll have a few moments to post some more thoughts.Cheers,Timkin 0 Share this post Link to post Share on other sites
Timkin 864 Report post Posted November 15, 2006 Okay, I've got a few moments... phew...Where are you at with this problem? Same place as last week? A few thoughts...To partition rho, first map it into a bounded space, using say r=tanh(rho), then discretise this space using a uniform partition set. While this gives you a uniform partitioning of your discrete space (alpha,beta,r), it provides a non-uniform discretisation of the original state space. One problem you may face is that the learning will be insensitive to control variations far from the target. The other choice is to make time a state variable and give it a non-uniform disretisation as a function of rho. I probably wouldn't go down that path though.Have you managed to generate a discrete map of the form s_{p} = G(s_{q},a_{m}) (for states s and controls a) from your equations of motion?As for the HJB approach... you basically need to write down your equations of motion for the system in Hamiltonian coordinates (generally not a trivial task). This is much the same as casting a problem in Lagrangian coordinates (and indeed, one way to get the Hamiltonian dynamics is to first write down the Lagrangian and then transform to the Hamiltonian using the known transformations between these two systems). Why is this a good thing to do? Because given the Hamiltonian, the optimal control problem in continuous space is trivially expressed as a dynamic programming problem. Q-learning is basically the discrete space version of the discounted HJB equation. Why do I think this is a good path to follow? Because, when you end up dealing with UAVs and AUVs, you'll be dealing with vehicles in a non-zero background flow. This will royally screw with any discretisation scheme and you'll need to solve your problem in its natural continuous space taking into account the asymmetry of motion induced by this flow.I'd enjoy the opportunity to chat to you about your project as well. Send me an email to the address in my profile and I'll send you my work contact details. In particular I'd like to find out what Swinburne is doing in the area of UAVs and AUVs and see how that matches up with our work at Deakin.Cheers,Timkin 0 Share this post Link to post Share on other sites