There are few things you can implement to handle large scenes:
1) View frustum culling (as mentioned by vinterberg). With this you avoid rendering anything outside of the view frustum as a simple way to keep draw calls down. You typically do this by having objects bounded by a simple shape (e.g. Axis Aligned Bounding Box or Bounding Sphere), which are then intersected with the camera frustum. To improve this, a hierarchy or grid is often used e.g. you first check if the bounding shape of an entire block of buildings is visible, if so, then check the individual bounding volumes within. Look into grids, nested grids, quadtrees and octrees as structures often used to accelerate this process.
2) Level-of-Detail (LOD) systems: Instead of drawing a full detail version of an object at all distances, you prepare simpler versions of the object which are then used when far enough away from the camera, so much so that the extra detail wouldn't really be visible. You can also stop drawing small or thin items at a distance too e.g. if you look all the way down a street, the rubbish bins needn't be rendered after a few hundred meters etc. For a very large scenes, people will sometimes use billboards or imposters (quads that face the camera essentially, very similar to particles) as the lowest level of detail. Generating the LOD models is often a manual process, but it can be automated in some cases.
3) Occlusion culling: Being inside a city, you can make use of the fact that buildings often obscure lots of the structures behind them. There are a number of techniques to do this:
i) you can break the level into parts and precompute which parts of a level are visible from each area (search for Precomputed Visibility Sets). This technique is fairly old fashioned as it requires quite a bit of precomputing and assumes a mostly static scene, but it can still be handy today in huge scenarios or on lower spec platforms.
ii) GPU occlusion queries - this is where you render the scene (ideally a much simpler version of it) and then use the GPU to determine which parts you actually need to render. Googling should provide you with lots of info, including the gotchas with this approach.
iii) CPU based occusion culling - this can be done by rasterizing on CPU a small version of the scene into a tiny zbuffer-like array, which is then used to do occlusion queries much like the GPU version. This avoids the latency of the GPU version at the expense of more CPU cost.
iv) A mix of both GPU and CPU approaches where you re-use the previous frame's zbuffer.
There are other methods but I think these are the most common.