There is a lot that you can do to parallelize pathfinding even for a single agent.

I would start by considering more basic algorithms than A*, and only incorporating some of A*'s features (prioritized sweeping, a heuristic) as you understand how to do the more basic algorithms in parallel.

**Conceptual Starting Point #1: Breadth-first search**

A* is basically a souped-up breadth-first search. You can do BFS in parallel: Expand all the nodes in a ply in parallel, and synchronize between plys. Can you do that on the GPU?

**Conceptual starting point #2: Value Iteration**

Value Iteration is more general than A* and very easy to parallelize. In the case of the single-source shortest path problem, it reduces to the following. A value d(v) is associated to each vertex v; it represents the distance to the goal. It is initialized,

d(goal ) = 0

d(v) = Infinity for all v != goal

Then, you simply do the following:

d(v) := min_{u in neighbors(v)} [ edge_cost(v,u) + d(u) ]

E.g., if vertex v has three neighbors, u1, u2, and u3, then it looks at the three quantities,

edge_cost(v, u1) + d(u1)

edge_cost(v, u2) + d(u2)

edge_cost(v, u3) + d(u3)

and picks whichever one is smallest.

This minimization can be performed in parallel by all vertices. To do it, a vertex needs only to look at its neighbors' values. This can be done either in a "Jacobi style" (in which new values are written to a separate buffer, and buffers are swapped at the end of the iteration -- this is the classic Value Iteration), or in a "Gauss-Seidel style" (in which new values are written "in-place" back to the original buffer -- sometimes called "Asynchronous dynamic programming"). The point is that you have a great deal of flexibility in the order in which you do these updates, while still guaranteeing convergence. Dijkstra's algorithm and A* are "just" very, very careful optimizations of this basic iteration, first to avoid updating nodes that don't need updates, and then to bias the updates in the direction of the goal (while making certain guarantees).

**Detail: Bidirectional search**

Shortest path problems can be solved more quickly by bidirectional search. At the very least, you can perform these two searches in parallel (though this is more of "cpu-level parallelism" than "gpu-level parallelism").