I wouldn't do this via a gradual advancement of the state, instead I'd write a pure function that takes the two input cameras (initial/target) and a fraction, and blends them to spit out a new camera setup each frame.

When initializing this "transition camera", you record the start time, then calculate the end time for the transition. You can do this by looking at each of the components to be interpolated (positions, rotations, FOV values, etc) and each of their max rates (max metres/s, degrees/s), etc to find the minimum amount of time that it would take each individual component to be interpolated, then take the maximum of all those times as your actual time that the inerpolation should take.

e.g. if you can move from A->B in 1s, but it will take 3s to rotate from looking at C to looking at D, then the movement should be slowed down so that it also takes 3s.

You then add this to your starting time to calculate the ending time.

Each frame, you then look at the current absolute time, and the start/end time to calculate a fractional value of how far through the transition you should be (*e.g. (end-now)/(end-start) *) If this is 1.0 or greater, you're done with the transition camera and you can just start using the target camera. Otherwise you pass the two cameras and this fraction into your pure function to calculate a blended camera.

For parameters like FOV, you'd just lerp them using the fraction. For rotation, you can just slerp them using the fraction.

For position, you could also just lerp them using the fraction, but this would move in a straight line between the two. If you want the camera to move forwards in a curving arc, you could define a spline using the two positions, and the two forward vectors as the tangents at those positions.