1) The workload is masked by the fact that the main thread is not using the resources the hyper-thread is using. Like if you're carrying rocks from one pile to another and you pick up a small rock in one hand, you can pick up another small rock in your free hand because you're not using your full strength yet.
1) You would think Hyperthreading slows down the execution, because it's having to load and unload data for each operation. Maybe not in certain test cases though.
True and true.
Once an Intel engineer told me if the system is fully using the memory bandwidth, Hyperthreading can't kick in. Once said out loud, looks obvious, but it's not something you would realize quickly.
2) So each core can do a few things, and this is RAID right?
RAID is for disks.
3) Each core can fetch instructions out of memory freely? And they can make changes to the memory freely?
Yes. They ask the memory controller, and it will try to honour by order of arrival. If they arrive at the exact same time then some predefined order is taken (i.e. it's like two guys trying to grab the same tool at the same time; and the same guy always says to the other one "you go first")
Without any kind of synchronization, this "make changes to the memory freely" behaviour can be problematic and chaotic. For example core A reads the value "0" and adds 1, then stores it. Core B does the same thing. If they both read the value at the same time, by the time they both finish they will write the result "1", while the probably expected behaviour is that the result should be "2".
It's called atomic operations. If an operation is atomic, then by the time the other core tries to access it, the other core already loaded it, modified it, and stored the result back.
Non-atomic operations don't have that guarantee. For example writing to unaligned memory in x86 is not atomic. So if I write the float "1.0f" to an unaligned address in memory from core A, while trying to read it from core B it may be a mixture of "1.0f" and whatever value it used to have. The value that core B reads would be complete garbage (could be 1.0f, a NaN, an infinite, any random number). The only way to fix that is ensuring there are only aligned reads.
But more complex operations (like reading multiple numbers) aren't atomic either (regardless of alignment) so if you don't use synchronization primitives provided by the OS or the HW, it's going to fail sooner or later.
Non-atomic operations can be seen like peeling a carrot while another guy grabs it and boil it, without even realizing his coworker wasn't finishing peeling it.
So they all just run all at the start of the program?
They start whenever the program requests another thread to the OS
And if you do two unrelated things they can multithread them?
Best multithreading scenario since there is no shared data, no need to synchronize at all. It's like one guy doing laundrey while the other one cooks. Or two guys cooking dinner in different kitchens.
I thought that was up to the compiler to handle planning for that kind of thing, but I wouldn't know how it would if it could.
Threading is so complex that compilers can't handle it. There is active research now on compiler automatically trying to thread trivial scenarios, but nothing concrete today.
What is popular are compiler threading extensions that with a few keywords (and by following lots of strict rules), you're telling the compiler "this is threadable, run it in parallel" the compiler checks you're following most of the rules, and tries to thread it. It doesn't work for very advanced stuff though.
4) Ignoring virtual cores and single core multithreading, how does one describe how a multi core CPU executes code. Is message passing built in, or do they all just start running at the same time? I guess I don't really understand that level of computer architecture, but I'd like to get a feel for it.
They just run at the same time. But there is some wiring between each core:
- Instructions with lock prefix. They're guaranteed to be atomic. The CPU sends a signal to a common place, and this place marks the memory region to be operated on as "locked" until the CPU sends the "finished" signal. If another signal comes from another thread for the same memory region, this place tells the CPU to stall, and later tells to resume. This happens whithin a couple cycles and is very fast, but only works for simple operations like, addition, substraction, multiplication, etc.
- Kernel 0 synchronization instructions. Only available in Kernel 0, those instructions tell the CPU to pause and resume operations (i.e. Core A tells core B "wait for me to finish until I say so!"). This is how the OS can implement mutexes and other synchronization mechanism. You have to ask the OS, because normal applications don't have Kernel 0 access. If they had, viruses could hang your PC very easily (just tell all cores to pause indefinitely)
- Internal stuff. For example the Core i7 has sync. mechanism in the L3 to reduce performance hit by something called "false cache sharing". This is REALLY advanced stuff, so I won't go into detail about it.
Cores execute the instructions in their cache right? They can fetch instructions from memory to their cache? So how does the computer prevent two cores from executing the same code at the same time? Does each core start at a different instruction address or something??
Instructions are first loaded from Memory RAM and fetched into L1 cache. The CPU executes the code in the L1 cache.
First of all, it may be the case that the computer actually wants to execute the same code at the same time, so they just fetch the same memory RAM into each core's L1 cache. This is perfectly legal. They may want to work on different data though. It's like running two instances of Media Player (same code) but viewing different movies (different data). When you ask the OS to start a thread, you're allowed to pass an integer with it. You can store in this integer whatever data you want, including a pointer to memory "i.e. I want you to work on THIS data I'm sending you"
As for your question "how does the computer prevent two cores from executing the same code at the same time", If you really want to prevent this, it has been answered above: Using the OS functions (like mutexes) that map to Kernel 0 instructions.
But 99% of the time, what you really want is to prevent two cores from working on the same DATA. Code != Data. i.e. Those two media players instances may be playing different movies, but they will share the same settings file. Only one can have read & write access to it, while the other one can only have read access (or no access at all, and resort to factory defaults) until the other instance releases the write lock on the file.
Multithreading isn't for beginners. By nature they tend to be chaotic, and many "fail" scenarios appear that transistors have to be dedicated to fix those. For example if Core A modifies X memory address, and this address was already loaded into all 4 cores' own L1 or L2 caches, the cache must be updated with the new value so they don't work with outdated data.
Thus, the memory controller must keep track of what memory regions are mapped in the caches and see if the address X about to be modified was loaded by other caches. But caches can't just update one single value, they work by reading "lines" (chunks of adjacent memory, in L1, typically each line is 64 bytes) so they have to load the whole line again. If this happens too often, performance can be severely affected.
Examples like this multiplicate, hence the complexity in HW and SW about multithreading.
GPUs for example are just raw power and don't care about anything about this cache "out of date" thing. If you modified X and didn't flush the caches, it's your fault and caches will be out of date. May be you don't need the value to be up to date, and you working with older values is perfectly fine or just tolerable/acceptable. And of course, performance will be bigger because the HW isn't doing anything behind your backs to ensure you don't mess up.