Also, you'll probably want to invert your for loop order, and do z, y, x, it's more memory friendly. If you are wondering why, imagine if you had a single array that was of length 16*16*16, and think of how you're jumping around in it as you travel through your innermost for loops.
This is always a good place to look, but OP's code seems consitent in that X is both the outside loop and the most-significant ordinal (array indexer?). In other words, the X and Z "labels" are consistently swapped, but they don't seem to be mismatched in the way that usually causes the cache to be thrashed. The speedup seen was likely entirely due to getting rid of sqrt() -- or I'm not reading the code with the comprehension I think I am