Since the topic is so complex, I'll just put some very high-level notes while not even trying to formulate a complete answer to the actual question.
1) For cache-friendliness, try to ensure each operation's required memory is in a continuous memory region, and preferably accessed linearly from start to end. For example updating a particle system or an animated skeleton. Avoid "jumping" from heap object to another through pointers. A well-structured entity-component system is possibly a good match here, at least if the systems don't need accessing each other's data a lot.
2) A typical approach nowadays is to structure the execution of a frame into a task graph, and the tasks are executed on any available cores by worker threads. Data from each task or operation flows to the next, for example physics simulation & animation together produce the final positions of game objects, which go to culling, from which a visible object list goes to render command generation, which is finally submitted to the graphics API. When the culling / render stage for the current frame is running, you can already be calculating the logic for the next frame.
For both points 1 & 2 it's probably best to keep scripting for low-volume, high-level operations (e.g. ask pathfinding to start moving this object toward position x) or configuration (stats of game objects, list of render passes and postprocess effects for the actual rendering code). For such infrequent use even a singlethreaded script VM could be fine.
Here's Naughty Dog talking about their engine's multithreading & memory allocation: http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine
Also old Bitsquid development blog entries may be interesting reading: http://bitsquid.blogspot.com/