As I recall the GPUs still make you do a lockstep instruction sequence of 32 data sets in parallel.
?? Partial patrallel processing for the step you find adjacents to add to the Open List ?? in 2D thats 8 on a regular grid and 26 candidates for a 3D regular grid
You may be able (for a single Agent thread) to employ a HeapQ which will keep your Open list tree best candidates in sequential order in the same memory space. (to be processed in lumps in parallel) the adding back to the HeapQ unfortunately would have to be sequential
A significant part of the A* processing will still be irregular (and lower the pure use of the parallel nature of the GPU alot)
You more definitely could have A* paths being calculated in parallel for Multiple Independant Objects (or multiple targets being considered for the same Agent/object)