Like alvaro said, if you comment out that "culprit line", everything else will be removed also by the optimizer, so the whole inner loop is the actual culprit.
There you have 375k 'floor' ops by converting to int and then multiplying that int by a float. That is not a good thing. Try replacing it with 'floor' and see if you get any difference. Floor isn't free in general though.
Try enabling SSE2 if you are compiling for 32-bit, as faster int/float conversion ops are used then. I'm not sure how many of those the compiler is allowed to optimize away. Try to not do arithmetic with both integers and floats together.
In addition you have 375k int * float operations from 'x/y/z * stride'.
Then you have like a million or two adds and subtracts, same for multiplies, and 250k branches.
Interesting, I hadn't realized that was such an expensive operation but that makes sense, I'll try changing it so ints and floats aren't used together. Yeah this whole functions seems very expensive, any other ideas on how to remove the O(N^3) complexity? I guess maybe that's just the nature of SPH fluid surface generation. I've tried to think of ways around that but am stumped. Currently it looks like the compiler is set to use SSE2 (I'm using MSVC with a /arch:sse2 compiler switch) dissassembly shows use of the xmm0 register.