As far as I know, calling present is same as flushing the command buffers of the GPU. So the profiling you do before the present command gives you only information about how long time the CPU spent handling the issued commands. The CPU isn't waiting for the GPU to finish the commands, until you call Present (or you do something else that forces the GPU to sync with the CPU).
To clarify a bit further...essentially the driver won't let the CPU get more than N frames ahead of the GPU. So if the CPU is just spitting out lots of commands the GPU is taking a long time to execute them, the driver will start to block the CPU in Present once N frames have been buffered. This magic "N" number can be queried and set with IDXGIDevice1::GetMaximumFrameLatency and IDXGIDevice1::SetMaximumFrameLatency (it defaults to 3 frames).
So yeah, you definitely want to dig into GPU profiling. You can perform some limited GPU profiling yourself using timestamp queries, but you have to be careful with them because they're often not accurate (this is because they only measure how long it takes for the GPU to actually get to the end of the time stamp, and not how long it takes for the GPU to completely finish executing the Draw or Dispatch calls being bracketed by the time stamps). Ideally you'll want to use a vendor-specific tool like Nsight or GPU PerfStudio.