Yes, the counter used by Increment/DecrementCounter and Append/Consume is the same hidden counter. Append/Consume are just a bit of syntactic sugar that warps usage of the hidden counter. You can implement the same functionality yourself with Increment/DecrementCounter:
uint outputIdx = OutputBuffer.IncrementCounter();
OutputBuffer[outputIdx] = outputData;
If you look at the generated bytecode for a shader that calls Append, you'll see that it does the equivalent of the above code. Let's try on this this simple compute shader:
AppendStructuredBuffer<uint> OutputBuffer;
[numthreads(64, 1, 1)]
void CSMain(in uint3 ThreadID : SV_GroupThreadID)
{
OutputBuffer.Append(ThreadID.x);
}
Here's the resulting bytecode generated by FXC:
cs_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u0, 4
dcl_input vThreadIDInGroup.x
dcl_temps 1
dcl_thread_group 64, 1, 1
imm_atomic_alloc r0.x, u0
store_structured u0.x, r0.x, l(0), vThreadIDInGroup.x
ret
The imm_atomic_alloc performs the atomic add on the hidden counter, and the value of the counter prior to that add is used as the index for writing into the structured buffer.
Different GPU architectures have different ways for implementing this functionality. The most straightforward way to do it would be for the GPU to support a global atomic add on a memory location. Then the "hidden counter" is just a 4 byte value somewhere in GPU-accessible memory. On the GPU architectures that I'm familiar with (AMD's GCN family), this sort of global atomic operation is supported and could be used to implement the counter. However these GPU's actually have a bit of on-chip memory called "GDS" that the driver uses for this purpose. Using GDS allows the operation to be kept entirely on-chip without having to go out to caches or external memory.
On AMD GPU's the way these operations work would allow the counter to be accessed concurrently by two different shaders with no issues. I would assume it would also work on other GPU's, but I'm not too familiar with Intel chips and Nvidia is notoriously tight-lipped about the specifics of their hardware. Either way I'm not sure exactly what guarantees are made by the D3D12 API. D3D11 didn't even have the notion of multiple shaders executing in parallel, and essentially forced sync points between sequential dispatches (as if you always used a UAV barrier). Since most of the shader-centric documentations is from the D3D11 era, it doesn't have any information about accessing counters across multiple dispatches. The docs for D3D12_RESOURCE_UAV_BARRIER do say this:
You don't need to insert a UAV barrier between 2 draw or dispatch calls that only read a UAV. Additionally, you don't need to insert a UAV barrier between 2 draw or dispatch calls that write to the same UAV if you know that it's safe to execute the UAV accesses in any order.
I would think that the "safe to execute UAV accesses in any order" would apply in the case of an atomic append, since you're expecting the results to be unordered. But perhaps someone more familar with the specs or documentation could clear that up further. Either way I would be surprised if what you described didn't work in practice.