In mine at the moment, if i fail to pop a job, I try to decrement a semaphore (which will block if it hasn't been incremented). Every time I push a job, I increment the semaphore.
When the queue becomes empty, yes, it does spin for a while until the semaphore reaches zero, then sleeps until the next push.
There's probably a better way to implement that using events/condition-variables...
This. You can even do it with your own "semaphore like" thingie (I wouldn't know what to call it... event count maybe?) which is basically two integers, each for produce and consume events. You can take the counter modulo an array's size as the index. The advantage being that it's easier to implement than a "typical" lockfree queue and it can work without compare-exchange (on the producer end, in any case).
On the other hand, I've seen zero difference in practice between all that awesome lockfree stuff and a simple queue with a lock or a semaphore (even abusing a Windows I/O completion port as queue yields surprisingly good results with practically zero implementation work, done that). In retrospective, I would personally never waste the time it takes to implement a lockfree task queue again (it's much harder to get error-free than it looks like at first glance, and it's doubtful whether you really gain anything).
Yes, lockfree stuff matters if you push 250,000 tasks per frame. But well, you don't do that. You do want to partition work, of course, but you also want cache coherency and you want to avoid bus contention or context switches, so you won't push too small work items. Which means you won't push too many.
If you push 5,000 or so tasks per frame, the measurable gains are exactly zero (at least that's what it looks like to me... please correct me if you have measured something different), and the overhead of a typical fast userspace lock (with kernel backoff) is very much acceptable. The implementation, on the other hand, is soooooooooo much easier and less error prone. Yes, it's not cool, but it works.