I thought the key system 'thread' overhead was the locks and such of making a system level call
You don't need locks just because you have threads; it depends on your data structures.
Similarly, you don't avoid the need for locks when you use fibers, unless you apply very strict rules to when you may call fiber-switching functions. Which means things like, "don't do I/O in fibers," which ends up being too restrictive for most cases.
The benefit of fibers is that re-scheduling to another thread/fiber is cheaper in userspace than in kernel space. That's it. If you are drowning in context switches, then fibers may help. But if that's the case, you're probably threading your application too finely anyway.
Mechanisms (data) for 'fibers' dont have to be on the stack
You are in a fiber. You make a function call. Where does the state of the current function get saved? Where do locals for the new function you call get allocated?
Separately, that other function you call, may call a yielding function. Where does the OS save the locals of each function on the call stack while it switches to another fiber?
Of course fibers need stack space. You may allocate that from the heap or other unallocated memory regions for each fiber, just like the OS allocates stack space for threads from unallocated memory space.