- Your OpenCL kernel actually runs on the CPU (you didn't tell what implementation you use)
- Your OpenCL kernel runs on a GPU, but the runtime is absolutely dominated by PCIe transfer latency, not execution speed.
Also, launching a kernel and synchronizing for the result isn't completely "free" either.
Try again with a much more complicated kernel, and you'll likely see a much bigger (50-100 times) difference.