* Optimal struct size
On a 32bit CPU, are there advised sizes (32,64, 128, ...)? And, do these double on a 64bit CPU? When dealing with vectors and structs, the size often grows quickly.
Optimal struct size depends entirely on the nature of the memory access patterns and the entire memory system for the machine on which the algorithm is being executed. Your question here seems to imply that if you choose a certain size for the struct, you'll get good performance (or some other nice optimization by some other metric), when no such thing is guaranteed. What matters to the CPU is that you're always grabbing memory locations in an order that guarantees a high likelihood of touching something already in the cache. A small struct size can make that easier to achieve, but a small struct size can be worthless if you're grabbing array then you grab array.
* Filling records
Lets say an optimal size would be 64 bytes, but I only have 60. Is it wise to add some dummy bytes to fill (not concerned about memory usage)?
I guess this is what Delphi does by default btw, unless you declare a "packed record".
Let your compiler handle this. In fact, I would say don't worry about this at all unless you truly have some time critical piece of code. The only places I've seen that consistently had gains from something like this were acceleration structures for graphics algorithms where you could literally be doing millions of queries. I remember stressing over bits to try to put a Kd-tree node into a cache line for a particular system, and in the end, it only netted me 10%-15% performance improvement. Don't get me wrong, that kind of performance improvement can be huge, but it's one of those last resort optimizations and all you're doing is trying to reduce the size of the constant in your asymptotic analysis.
* Splitting in multiple records
In case 64bytes would be an optimal size, but I need at least 100 bytes... Would it be smart to divide the data in 2 records, each 64 bytes each?
Again, this depends entirely on your memory access pattern and the nature of the machine you're running on. The only correct answer here is: "Maybe". If we were to assume that the algorithm you're executing touches a particular set of data more often than another, you may find gains by splitting your algorithm and data into one that relies on a hot data and cold data split. You put things you commonly access into the hot data struct and things you access less frequently into the cold data. The goal with this sort of strategy is to pack more of the hot data into cache lines so you can take advantage of the caching system. The rationale for the performance gain hinges on the fact that you should be accessing hot data much more frequently so it is very costly if you cache miss on this memory access, however, if you access cold data, you're likely to cache miss, but since you don't access this data very often it shouldn't have a huge negative impact on performance. It's using Amdahl's law to arrange memory.
* Dunno how caching works exactly, but I guess there rules apply especially for looping through arrays structs. But, does it also apply when passing just a single struct to a function?
The memory hierarchy for modern computer systems are both simple and very complicated. They are simple in the high level concept, but very complicated in the nitty gritty details. I would highly recommend learning up on computer architecture to get a better idea of how the hardware works in general. You can find some great amount of detail of memory and caches in http://www.akkadia.org/drepper/cpumemory.pdf. But, to answer your question, caching always applies. Every single data access you do will be affected by the cache hierarchy on modern CPUs (since just about every single CPU now has a cache). You have your instruction cache, your data cache, and nowadays, multiple levels of data cache (sometimes shared between cores, sometimes not). The CPU works pretty much exclusively with registers and the cache, not with main memory.
Whether or not these things matter to you in your particular situation is another question. I'm of the opinion that performance always matters, but you only have so much time in your life to implement your algorithms. The real issue is which parts of your program have performance implications that actually matter to your end user. Again, Amdahl's law. Use it, use it everywhere.