Yes, although there is a "but...".
The GPU generally makes no promises and gives no guarantees whatsoever, and indeed works much differently from what one would "intuitively" expect.
However, the graphics or compute API that you use (such as e.g. OpenGL, CUDA, Direct3D) will usually give one or the other guarantee, and most of the time it does not matter in which order operations happen anyway.
Unless of course, when it matters... that's when you need to use things like barrier (CL) or memoryBarrier (GLSL) or glFenceSync on a higher level, or functions like glTextureBarrier.
Now, when does it matter in which order things are processed?
As a rule of thumb, it usually doesn't matter as long as you stick with the more "traditional" render pipeline:
- It doesn't matter whether you process vertex 5, 34, or 732 first, they are not dependent on each other. You wouldn't know a difference and you don't care.
- It matters that all vertices of a primitive (such as a triangle) have been processed before the geometry shader is invoked. The implementation ensures this is the case, simply by processing all the vertices, and then invoking the geometry shader (you need not care).
- It matters that all vertex/geo/tesselation stuff (... belonging to one draw call) is done before the fragment shader is run. Again, this is trivially assured by how the pipeline works.
Triangles are rasterized to fragments with some unknown (unknown to you) method and are then processed in parallel in groups of 2x2 or larger (this is necessary for partial derivatives / mip calculation). Some fragments may be shaded although they are not part of a triangle at all, they will be discarded but are still shaded. Some may not pass a test (depth, stencil, whatever) and be discarded. Some fragments may be shaded twice (think fragments on the diagonal of a fullscreen quad, which is really just two triangles from the point of view of the hardware). Some will be weighted using some known or unknown or tuneable function (think multisampling).
Usually, rather than just 2x2, something like 64 or so fragments will be processed in parallel in a shader core running the same identical instructions at the same time (with several thousand queued, swapped in and out on demand to cover for texture/memory latency), and a few dozen or hundred execution units will run independently of each other.
Whatever! Not your problem! It is guaranteed (by the API contract, so it's finally the driver's problem) that what comes out is the same as-if everything happened exactly in the order that you specified. This is still relatively easy for the implementation to guarantee, since while you are allowed to read pretty much everything, you can only ever write to a single exactly specified location (in other words, you have gather functionality, but not scatter). So all the implementaion really needs to be doing is not mess up its own order of rasterization and blending.
So far for the easy part. Now there are atomic counters and shader load/store, which allow you to do... scatter -- write to more or less arbitrary locations, concurrently. This is where it gets ugly.
If you use shader load/store, you must take extra care. Writing to haphazard variables or memory locations not knowing which one of your fragments will be shaded first can, and will, lead to surprising results. It doesn't make a difference whether fragment 43772 is shaded before fragment 43775 if each one can only ever write to its own output, which is under the control of the driver. But it matters a lot when they both write a value to memory location 123456 or if they both modify a counter, and this happens in a different order than you had expected.