My assumption is that all atomic writes on a UAV are immediate visible to all other threads since for one dispatch call thread in different threadgroup can see the atomic result immediately (or this is not true?)
Even if such assumptions are true for current hardware, no one knows what changes in the future.
Following my own advice, i've just added all missing barriers to my project.
Before the change averaged runtime over 10 frames was 2.60ms, after the change still 2.58ms.
So there seems no price to pay when using proper barriers :)
Oops - i forgot that i already read a timestamp at all those places where the new barriers have been added.
Reading timestamps has a noticeable performance cost, probably it's the same for the barriers - i'll have a look...
EDIT:
After removing fine grained timestamp reading i get 2.432ms.
There are only 2 barriers left i could remove while keep things working then i get down to 2.418ms.
So memory barriers have a cost but it's not that bad.