How does Multithreading Assembly work?

Started by
16 comments, last by Tribad 11 years, 3 months ago

I've been mulling this for a long time, but I can't picture the computer executing assembly instructions in parallel.

Does each core execute it's own stream of instructions? Or what?

I'm not talking about implementation details, I'm curious about how multicore processing is implemented in the circuitry and such.

Go easy on the jargon though, I'm no where near up to date on this stuff. I'd get the theory if you stuck with metaphors or the more common wording though.

Advertisement

Yes, each core is a full CPU - each has its own set of registers (including the program counter). As a result, each core is running a separate thread in the system. It doesn't make sense for two cores to be running the same thread. A 4 core CPU is almost the same as having 4 separate CPU sockets each with a 1 core processor, except a 4-core processor on the same die will share various forms of cache and such.

Yep, a 4-core CPU is no much different from 4 single core PCs running at the same time; except for very basic synchronization mechanisms i.e. the 4 cores share the same RAM and Bus.

For example typical Intel Core 2 Quad, each core has it's own L1 cache, L2 cache is shared by group of 2 cores.

In the typical Intel Core i7, I don't remember how the L2 is shared (or if it was...), but there is an additional L3 cache shared by all cores.

Do not confuse "multithreading using multiple cores" vs "instruction level parallelism" vs "multithreading using hyperthreading"

The first one is the one described.

The second one has nothing to do with cores or multithreading and is present in single core CPUs since the original Pentium. Each CPU has more than one pipeline to execute multiple instructions at the same time. If you have for example "add eax, ecx" followed by "add edi, ecx" (i.e. a = a + b & c = c + b) both instructions are independent and each pipeline will execute both at the same time. It may also be called "instruction pairing". Not all instruction permutations can be paired (because some CPU architectures don't clone the whole pipeline, just a part of it. Usually for cheaper manufacturing or due to high power consumption)

Note that the CPU won't pair two dependent instructions. For example "add eax, ecx" followed by "add edi, eax" can't be executed in parallel, because the second instruction depends on the result from the former one. Compiler optimizations rearrange instructions to interleave non-related operations so that instruction pairing chance can be maximized.

Also take in mind x86/x64 CPUs implement something called "Out of Order Execution" which is basically a component in the CPU looking ahead in which instructions are going to be executed, and if it finds one that is independent (i.e. one that doesn't depend on the result of eax), and executes that instruction at the same time as "add eax, ecx" then caches the result in temporary HW space for the time those instructions that come later get actually executed. OoOE may also kick in when one of the pipelines is stalled waiting for data in RAM to arrive. In other words, it does the same as the compiler does (reorder operations to maximize instruction pairing and minimize stalls) but at hardware level, and you have little to no control over it (except for instructions that issue a memory barrier)

When dealing with lock-less multithreading (as in multi-core) code, OoOE can be a PITA if one doesn't make correct use of memory barriers (to prevent the OoOE unit from looking farther than the barrier, until the barrier is reached) because it may lead to race conditions.

The third one, hyperthreading, was invented by Intel, and a fake of the first one (multicore). When the CPU is only using one of the pipelines described as instruction-level parallelism and the OoOE couldn't do anything to prevent it, we've got a lot of idle pipelines.

Because those pipelines are "almost" an extra core, Hyperthreading kicks in and simulates an extra core to execute a parallel thread, but is not near in the same level as a real additional core because: 1. Not all parts of the pipeline are duplicated and 2. The main thread may actually start using those unused pipelines, leaving no spare components to dedicate to the hyperthreaded thread.

Intel claims Hyperthreading yields an average gain of 15% in performance.

Btw, if you open Task Manager in a single core system with hyperthreading, it will tell you there are two cores (the real one + fake one). In Intel Core i7 with 8 cores, it will tell you there are 16 cores (8 real ones + 8 fake ones)

Not all CPUs come with HT, in fact, most of them don't.

1) You would think Hyperthreading slows down the execution, because it's having to load and unload data for each operation. Maybe not in certain test cases though.

2) So each core can do a few things, and this is RAID right?

3) Each core can fetch instructions out of memory freely? And they can make changes to the memory freely? So they all just run all at the start of the program? And if you do two unrelated things they can multithread them? I thought that was up to the compiler to handle planning for that kind of thing, but I wouldn't know how it would if it could.



4) Ignoring virtual cores and single core multithreading, how does one describe how a multi core CPU executes code. Is message passing built in, or do they all just start running at the same time? I guess I don't really understand that level of computer architecture, but I'd like to get a feel for it.

Cores execute the instructions in their cache right? They can fetch instructions from memory to their cache? So how does the computer prevent two cores from executing the same code at the same time? Does each core start at a different instruction address or something??

I'm sorry I just don't quite understand who tells which core where to start and such.

1) The workload is masked by the fact that the main thread is not using the resources the hyper-thread is using. Like if you're carrying rocks from one pile to another and you pick up a small rock in one hand, you can pick up another small rock in your free hand because you're not using your full strength yet.

2) RAID stands for "Redundant Array of Individual Disks", so probably not.

3) Yes, all cores have access to the system bus and cache. I don't know what you mean by 'they run'. Cores don't run. They're physical objects on the CPU die. The OS handles core distribution, since it controls the thread scheduler.

4) The thread scheduler locks the core, sets its registers, including the flags and instruction pointer and then unlocks it. The core will then execute instructions normally, ignoring the other cores unless its given a special signal. When it's time to schedule a new thread the scheduler just locks the core again, captures the registers and then sets them to the values of the next thread and unlocks.

As for the cache, the core is only vaguely aware of it. At the level of assembly you usually don't mess with the cache unless you have to. (Actually I don't recall reading about any instructions for messing with the instruction cache, but I haven't got too far into messing with the cache yet.) It's a separate unit that just streamlines the process of instruction fetching so that you don't have to wait on the bus as often.

Typically each core would start at a different address, yes, but there's no reason that more than one core can't be running the same instruction at the same time. The instruction stream is typically on a memory page with read-only access. The data that gets modified is either on the heap page (dynamic memory) or on the stack page (local variables and the function stack).

If you get comfortable with assembly and threading then you'll have an easier time with it.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

1) Thanks, but unfortunately I still don't get it. The main thread isn't working?? Is this to do with fetching memory delays? Thanks for being patient with me.

2) I'm mixing up my Acronyms again.

3) By run I mean execute instructions on data. Does the OS tell the other cores what to execute? Does one core execute the OS? That doesn't seem right at all. The OS can run on multiple cores, so how does each core know which part of the OS to run? Does another core tell them? So does it all start with one core??

I'm trying to do exactly that, get comfortable with them. I'm just not getting how assembly can multithread. How does one instruction ensure it's executed only on one core? I don't mean two cores using the same instructions, I mean the same instruction on the same data at the same time. I don't mean memory management, I mean how does one core know it's supposed to do something different than another core?

4) I actually thought that the "registers" were a special portion of the "cache", but that doesn't really make sense anymore lol. I was thinking it's rather important to be able to tell a core to purge it's cache for locking/unlocking functionality. You know, to ensure a fast access copy is up to date with the rest of the memory.

Warning, a bit off-topic:

[quote name='rdragon1' timestamp='1356819533' post='5015548']
each core is a full CPU
[/quote]

No, not necessarily, often not. It depends on the vendor and what he defines as a core.

It's not uncommon to share some parts between cores. It's cheaper (less silicon = more money), can save energy and sometimes it doesn't even mean that the performance is reduced.

One example is the UltraSPARC T1, it had 8 cores, but only a single floating-point unit. They where aiming at webservers and databases and didn't felt they need much floating-point computation. Of course for their target market it was correct, on the other side it meant the CPU was useless fpr HPC.

It's an old example and one of the more extreme ones, but even nowadays you can't expect 4/6/8/12/16/whatever cores to be the same as having the same number of complete processors.

On the other hand, having them (the cores) all on one die could mean that it's easier for them processor manufacturers to implement some kind of work stealing or similar things which could offset the lack of a few processing units.

Just take a look at wikipedia for more details on it. For example the bulldozer microarchitecture has something that is between Hyperthreading and having two actual cores. And there are many more of such interesting examples.

Back on topic:

[quote name='Blind Radish' timestamp='1356856439' post='5015690']
1) Thanks, but unfortunately I still don't get it. The main thread isn't working?? Is this to do with fetching memory delays? Thanks for being patient with me.
[/quote]

Maybe this will help: (https://en.wikipedia.org/wiki/Instruction_pipeline)

Imagine one thread can't fille the pipeline and the processing units (ALU, FPU) don't have enough work to do. Hyperthreading just makes the processing units do the work of another thread while they are waiting for new instructions.

[quote name='Blind Radish' timestamp='1356856439' post='5015690']
How does one instruction ensure it's executed only on one core? I don't mean two cores using the same instructions, I mean the same instruction on the same data at the same time.
[/quote]

You mean how does code get threadsafe? On x86/x64 by using the lock prefix, making the instruction atomic. Don't know how it's implemented in the hardware.

You also have to distinguish between read and writes, for example it's no problem for multiple cores to read the same data. However if the data is also written to it get's complicated.

[quote name='Blind Radish' timestamp='1356856439' post='5015690']
I mean how does one core know it's supposed to do something different than another core?
[/quote]

If all is the same (same data, same instruction, same flags) it's exspected to do the same. Or not?

1) You would think Hyperthreading slows down the execution, because it's having to load and unload data for each operation. Maybe not in certain test cases though.

1) The workload is masked by the fact that the main thread is not using the resources the hyper-thread is using. Like if you're carrying rocks from one pile to another and you pick up a small rock in one hand, you can pick up another small rock in your free hand because you're not using your full strength yet.

True and true.
Once an Intel engineer told me if the system is fully using the memory bandwidth, Hyperthreading can't kick in. Once said out loud, looks obvious, but it's not something you would realize quickly.

2) So each core can do a few things, and this is RAID right?

RAID is for disks.

3) Each core can fetch instructions out of memory freely? And they can make changes to the memory freely?

Yes. They ask the memory controller, and it will try to honour by order of arrival. If they arrive at the exact same time then some predefined order is taken (i.e. it's like two guys trying to grab the same tool at the same time; and the same guy always says to the other one "you go first")

Without any kind of synchronization, this "make changes to the memory freely" behaviour can be problematic and chaotic. For example core A reads the value "0" and adds 1, then stores it. Core B does the same thing. If they both read the value at the same time, by the time they both finish they will write the result "1", while the probably expected behaviour is that the result should be "2".
It's called atomic operations. If an operation is atomic, then by the time the other core tries to access it, the other core already loaded it, modified it, and stored the result back.
Non-atomic operations don't have that guarantee. For example writing to unaligned memory in x86 is not atomic. So if I write the float "1.0f" to an unaligned address in memory from core A, while trying to read it from core B it may be a mixture of "1.0f" and whatever value it used to have. The value that core B reads would be complete garbage (could be 1.0f, a NaN, an infinite, any random number). The only way to fix that is ensuring there are only aligned reads.
But more complex operations (like reading multiple numbers) aren't atomic either (regardless of alignment) so if you don't use synchronization primitives provided by the OS or the HW, it's going to fail sooner or later.

Non-atomic operations can be seen like peeling a carrot while another guy grabs it and boil it, without even realizing his coworker wasn't finishing peeling it.

So they all just run all at the start of the program?

They start whenever the program requests another thread to the OS

And if you do two unrelated things they can multithread them?

Best multithreading scenario since there is no shared data, no need to synchronize at all. It's like one guy doing laundrey while the other one cooks. Or two guys cooking dinner in different kitchens.

I thought that was up to the compiler to handle planning for that kind of thing, but I wouldn't know how it would if it could.

Threading is so complex that compilers can't handle it. There is active research now on compiler automatically trying to thread trivial scenarios, but nothing concrete today.
What is popular are compiler threading extensions that with a few keywords (and by following lots of strict rules), you're telling the compiler "this is threadable, run it in parallel" the compiler checks you're following most of the rules, and tries to thread it. It doesn't work for very advanced stuff though.

4) Ignoring virtual cores and single core multithreading, how does one describe how a multi core CPU executes code. Is message passing built in, or do they all just start running at the same time? I guess I don't really understand that level of computer architecture, but I'd like to get a feel for it.

They just run at the same time. But there is some wiring between each core:

  • Instructions with lock prefix. They're guaranteed to be atomic. The CPU sends a signal to a common place, and this place marks the memory region to be operated on as "locked" until the CPU sends the "finished" signal. If another signal comes from another thread for the same memory region, this place tells the CPU to stall, and later tells to resume. This happens whithin a couple cycles and is very fast, but only works for simple operations like, addition, substraction, multiplication, etc.
  • Kernel 0 synchronization instructions. Only available in Kernel 0, those instructions tell the CPU to pause and resume operations (i.e. Core A tells core B "wait for me to finish until I say so!"). This is how the OS can implement mutexes and other synchronization mechanism. You have to ask the OS, because normal applications don't have Kernel 0 access. If they had, viruses could hang your PC very easily (just tell all cores to pause indefinitely)
  • Internal stuff. For example the Core i7 has sync. mechanism in the L3 to reduce performance hit by something called "false cache sharing". This is REALLY advanced stuff, so I won't go into detail about it.
Cores execute the instructions in their cache right? They can fetch instructions from memory to their cache? So how does the computer prevent two cores from executing the same code at the same time? Does each core start at a different instruction address or something??

Instructions are first loaded from Memory RAM and fetched into L1 cache. The CPU executes the code in the L1 cache.
First of all, it may be the case that the computer actually wants to execute the same code at the same time, so they just fetch the same memory RAM into each core's L1 cache. This is perfectly legal. They may want to work on different data though. It's like running two instances of Media Player (same code) but viewing different movies (different data). When you ask the OS to start a thread, you're allowed to pass an integer with it. You can store in this integer whatever data you want, including a pointer to memory "i.e. I want you to work on THIS data I'm sending you"

As for your question "how does the computer prevent two cores from executing the same code at the same time", If you really want to prevent this, it has been answered above: Using the OS functions (like mutexes) that map to Kernel 0 instructions.
But 99% of the time, what you really want is to prevent two cores from working on the same DATA. Code != Data. i.e. Those two media players instances may be playing different movies, but they will share the same settings file. Only one can have read & write access to it, while the other one can only have read access (or no access at all, and resort to factory defaults) until the other instance releases the write lock on the file.

Multithreading isn't for beginners. By nature they tend to be chaotic, and many "fail" scenarios appear that transistors have to be dedicated to fix those. For example if Core A modifies X memory address, and this address was already loaded into all 4 cores' own L1 or L2 caches, the cache must be updated with the new value so they don't work with outdated data.

Thus, the memory controller must keep track of what memory regions are mapped in the caches and see if the address X about to be modified was loaded by other caches. But caches can't just update one single value, they work by reading "lines" (chunks of adjacent memory, in L1, typically each line is 64 bytes) so they have to load the whole line again. If this happens too often, performance can be severely affected.

Examples like this multiplicate, hence the complexity in HW and SW about multithreading.

GPUs for example are just raw power and don't care about anything about this cache "out of date" thing. If you modified X and didn't flush the caches, it's your fault and caches will be out of date. May be you don't need the value to be up to date, and you working with older values is perfectly fine or just tolerable/acceptable. And of course, performance will be bigger because the HW isn't doing anything behind your backs to ensure you don't mess up.

Ah, instead of trying to respond to each question individually I'll try to walk through threading on a single core. Then it's easy to see how additional cores can easily be added.

First let's talk about the general layout of a single-core CPU.

You have the ALU, which is the 'brain' that does the actual work, such as adding and subtracting or etc. When an electrical pulse hits the CPU the current instruction that's loaded into the instruction stream is used as a set of on/off gates that control which parts of the ALU get triggered. On bits pass on current and off bits don't. That current passes through the ALU and causes some change in the internal state of the CPU based on what instruction is being processed. Attached to the ALU are the CPU's registers.

The CPU will have several registers, depending on what type it is. If a CPU is said to be 32-bit that means that each GPR (general purpose register) is 32-bits in length. A 64-bit CPU has 64-bit registers, etc. On an x86 you have 4 registers to play with data: eax, ebx, ecx, and edx. They have different uses depending on the situation (which instruction you want to execute), but you can more or less load or store data to them as you please. In addition to these there are some other registers for special uses:

(These four are also considered part of the GPR and can be accessed directly by 'mov' instruction.)

ESP - extended stack pointer - points to the top of the stack

EBP - extended base register - points to the base of the stack (the point at which the current function's stack space begins... sort of)

ESI/EDI - extended source/destination index - used for making memory transfers or other looping memory tasks more efficient. Some instructions use these to loop through memory in either a forward or backward direction. Can be very useful in such situations.

Special, non-GPR registers:

EFLAGS - This register is used as a set of bits that indicate the state of the CPU. Instructions can set the different bits or react to the state of them, but direct access to EFLAGS is not possible (you can get at it, but it takes some foolery).

EIP - extended instruction pointer - points to the memory location of the next instruction to be executed.

Segment registers:

These registers are not as commonly discussed, but have to do with the virtual memory management model. Essentially there's a few registers that can be used to point to a specific location in memory, called a 'segment' or 'page'. When the CPU is in virtual memory mode and gets an address, that address refers to a memory offset within one of those pages. Generally one page will point to the memory image of the module (the exe or dll being executed), one will point to the stack, and one will point to the heap. Others can also be available depending on the program. Windows calls these 'pages' and has a set of functions for managing them. Pages can have read/write/execute permissions set and will throw an exception if those permissions are violated, such as good old 0xC0000005 aka the segfault (segment fault - get it?).

There are also SIMD registers on most modern CPUs. These tend to be larger than the GPR and are often also used as the floating-point registers. They implement things like SSE or MMX and the like.

A simple ASM program to add 1 to a variable looks like this:

mov eax,variable #load the value of 'variable' to the eax register

inc eax #increment (increase by 1) the eax register

mov variable,eax #store the value in the eax register in the memory belonging to 'variable'

This is where the cache would come in. When the CPU wants to read a value to memory it sends a request to the system bus. The bus interprets the request, finds the memory and sends it back to the CPU. This takes several CPU cycles to do, so the CPU ends up spending most of its time waiting for the value to arrive before it can continue. In order to solve this problem modern CPUs have an onboard cache. The CPU tries to predict what memory you're going to want and ask the bus for it ahead of time. Then when you go to load the value it's (hopefully) already on hand and you don't have to wait for the bus. Changes to the value are stored in the cache until the CPU determines that you're done with that data, then it flushes it back to main memory. On systems where more than one device have access to main memory this can cause problems, since the CPU can work on some data and then still have it cached - the other device may look in main memory and not see the changes that haven't been flushed from the cache yet. The 'volatile' keyword in C/C++ prevents the cache from holding the specified data. It always reads that data from main memory and always writes changes to main memory. The instruction stream typically doesn't need to know anything about what the cache is doing except in those cases. The cache more or less takes care of itself. (It actually reads the results of the CPU's operations in order to try and decide which instructions/data should be pre-fetched.)

A brief look at the stack comes next. To enter a function you typically use the 'call' instruction and give it an address:


void myFunc(char* message) {
  printf(message);
}

int main() {
  char myStr[] = "Hello.";
  __asm {
    push myStr;
    call myFunc;
  };
}

When a process starts it allocates a section of memory for the stack. The base pointer and stack pointer are set to the end of that memory. When a value is 'pushed' it gets written to the location pointed to by the stack pointer, then the stack pointer is reduced to make room for the next value. So if I say 'push eax', the value in eax gets written to the address pointed to by ESP, then ESP is reduced by 4 (meaning 4 bytes). When I say 'pop eax' the stack pointer is increased by 4 and then the value it points to is placed in eax. Memory which is at an address lower than ESP is considered 'invalid'. It can contain anything at all and should not be referred to unless you're doing something weird.

To call a function, first the arguments are pushed in reverse order, then the 'call' instruction is used. 'call' does the following:

push EIP to the stack (store the address of the next instruction - the one after the 'call')

jump to the label/address provided as an argument to 'call' (in the example it was the label 'myFunc')

The function called will usually set up its 'stack frame' like so:

push ebp (push the current base pointer to the stack)

mov ebp,esp (set the base pointer to point to the current stack pointer position)

sub esp,??? (where ??? is the total size of the function's local variables)

So what just happened there? Well, ebp gets pushed. This gives us a value to set it back to when we exit the function.

Then we set ebp's new value to the current stack position. This means that ebp now points to the previous frame's ebp. If we need to unwind the stack we've got a chain of pointers that all point right down the chain to our original ebp.

After that we subtract the local storage size from esp, meaning that we've reserved space for our local variables on the stack. The compiler knows which address in that space belongs to which variable.

At this point the following is true:

  • The function arguments begin at ebp+8 and increase upward (since we pushed them in reverse order prior to 'call')
  • Local variable storage begins at ebp-4 and increases downward
  • Local storage is not necessarily in the order of its C declaration. The compiler is free to order things how it wants within the stack frame.

Once the function is done it wants to return control to the caller. It does the following:

If there's a return value then place it in the eax register

Move the value in ebp to esp - this puts the stack pointer back at the point it was at when we entered the function, effectively freeing the stack frame's memory

Pop the stack into ebp - restore ebp to the value it had prior to 'call'. now it points to the base of the caller's stack frame again

'ret' - this instruction basically pops into EIP, moving the instruction pointer to the position of the instruction after our original 'call'

At this point the arguments that we pushed prior to 'call' would still be on the stack. There's two ways to handle this:

Provide a numeric argument when calling 'ret' - This value will be added to ESP after EIP is popped, effectively freeing the pushed args.

Just manually add the length of the args to ESP after the function returns. - This has to be done in cases where the length of the args is unknown to the function, such as in the case of printf, which can have a variable number of arguments.

Okay, so that's very complicated unless you just sit there and watch it in action for a while in the debugger.

Now we're almost to threading. The next concept is interrupts. A program (such as the thread scheduler) can set up a series of instructions in memory and then tell the CPU to register the start address there as an interrupt routine. The routine is matched to an interrupt number and remembered until the CPU is told otherwise. When the CPU gets an interrupt signal it will have a number accompanying it. The CPU records all registers, including EFLAGS, EIP and the segment pointers, then transfers control to the routine matching the interrupt number it just got. When the interrupt routine is done it restores all the registers and the program resumes.

So you may now have a notion of how a thread scheduler will work. A routine is set up as an interrupt routine, but instead of returning to the interrupted thread it saves the register state and then returns to the state saved from the thread that's scheduled next. Interrupts can also be set to trigger on timers, which makes threading easier to implement.

In the case of a multi-core system the scheduler just has the option of choosing which core to interrupt and give the next thread to. Apart from memory access concerns the cores don't really need to know anything about one another. They just keep executing the instruction at EIP until they get an interrupt.

All that's somewhat simplified, though. There's a lot of tiny details and craziness that goes on in there that can keep you busy for many sleepless nights.

Still, it's fun, right? biggrin.png

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Wow, that was quite an explanation that one Khatharr wrote.

Just one detail...
The 'volatile' keyword in C/C++ prevents the cache from holding the specified data. It always reads that data from main memory and always writes changes to main memory.
Volatile in C/C++ won't prevent or hint the cache. It just prevents the compiler from optimizing whatever is volatile; this is useful if for example you have memory mapped hardware and setting the same address three times to the same value is actually meaningful, the compiler doesn't just eliminate the 2 extra sets (i.e. it will compile into three mov instructions).
The volatile keyword will force to always load and store from memory on each operation, but that doesn't mean the cache won't try to... cache it.

This topic is closed to new replies.

Advertisement