• Create Account

# How does Multithreading Assembly work?

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

17 replies to this topic

### #1Blind Radish  Members   -  Reputation: 355

Like
0Likes
Like

Posted 29 December 2012 - 04:15 PM

I've been mulling this for a long time, but I can't picture the computer executing assembly instructions in parallel.

Does each core execute it's own stream of instructions?  Or what?

I'm not talking about implementation details, I'm curious about how multicore processing is implemented in the circuitry and such.

Go easy on the jargon though, I'm no where near up to date on this stuff.  I'd get the theory if you stuck with metaphors or the more common wording though.

### #2rdragon1  Crossbones+   -  Reputation: 1099

Like
0Likes
Like

Posted 29 December 2012 - 04:18 PM

Yes, each core is a full CPU - each has its own set of registers (including the program counter). As a result, each core is running a separate thread in the system. It doesn't make sense for two cores to be running the same thread. A 4 core CPU is almost the same as having 4 separate CPU sockets each with a 1 core processor, except a 4-core processor on the same die will share various forms of cache and such.

### #3Matias Goldberg  Crossbones+   -  Reputation: 2565

Like
2Likes
Like

Posted 29 December 2012 - 04:58 PM

Yep, a 4-core CPU is no much different from 4 single core PCs running at the same time; except for very basic synchronization mechanisms i.e. the 4 cores share the same RAM and Bus.

For example typical Intel Core 2 Quad, each core has it's own L1 cache, L2 cache is shared by group of 2 cores.

In the typical Intel Core i7, I don't remember how the L2 is shared (or if it was...), but there is an additional L3 cache shared by all cores.

The first one is the one described.

The second one has nothing to do with cores or multithreading and is present in single core CPUs since the original Pentium. Each CPU has more than one pipeline to execute multiple instructions at the same time. If you have for example "add eax, ecx" followed by "add edi, ecx" (i.e. a = a + b & c = c + b) both instructions are independent and each pipeline will execute both at the same time. It may also be called "instruction pairing". Not all instruction permutations can be paired (because some CPU architectures don't clone the whole pipeline, just a part of it. Usually for cheaper manufacturing or due to high power consumption)

Note that the CPU won't pair two dependent instructions. For example "add eax, ecx" followed by "add edi, eax" can't be executed in parallel, because the second instruction depends on the result from the former one. Compiler optimizations rearrange instructions to interleave non-related operations so that instruction pairing chance can be maximized.

Also take in mind x86/x64 CPUs implement something called "Out of Order Execution" which is basically a component in the CPU looking ahead in which instructions are going to be executed, and if it finds one that is independent (i.e. one that doesn't depend on the result of eax), and executes that instruction at the same time as "add eax, ecx" then caches the result in temporary HW space for the time those instructions that come later get actually executed. OoOE may also kick in when one of the pipelines is stalled waiting for data in RAM to arrive. In other words, it does the same as the compiler does (reorder operations to maximize instruction pairing and minimize stalls) but at hardware level, and you have little to no control over it (except for instructions that issue a memory barrier)

When dealing with lock-less multithreading (as in multi-core) code, OoOE can be a PITA if one doesn't make correct use of memory barriers (to prevent the OoOE unit from looking farther than the barrier, until the barrier is reached) because it may lead to race conditions.

The third one, hyperthreading, was invented by Intel, and a fake of the first one (multicore). When the CPU is only using one of the pipelines described as instruction-level parallelism and the OoOE couldn't do anything to prevent it, we've got a lot of idle pipelines.

Because those pipelines are "almost" an extra core, Hyperthreading kicks in and simulates an extra core to execute a parallel thread, but is not near in the same level as a real additional core because: 1. Not all parts of the pipeline are duplicated and 2. The main thread may actually start using those unused pipelines, leaving no spare components to dedicate to the hyperthreaded thread.

Intel claims Hyperthreading yields an average gain of 15% in performance.

Btw, if you open Task Manager in a single core system with hyperthreading, it will tell you there are two cores (the real one + fake one). In Intel Core i7 with 8 cores, it will tell you there are 16 cores (8 real ones + 8 fake ones)

Not all CPUs come with HT, in fact, most of them don't.

Edited by Matias Goldberg, 29 December 2012 - 05:01 PM.

### #4Blind Radish  Members   -  Reputation: 355

Like
0Likes
Like

Posted 29 December 2012 - 08:52 PM

1) You would think Hyperthreading slows down the execution, because it's having to load and unload data for each operation.  Maybe not in certain test cases though.

2) So each core can do a few things, and this is RAID right?

3) Each core can fetch instructions out of memory freely?  And they can make changes to the memory freely?  So they all just run all at the start of the program?  And if you do two unrelated things they can multithread them?  I thought that was up to the compiler to handle planning for that kind of thing, but I wouldn't know how it would if it could.

4) Ignoring virtual cores and single core multithreading, how does one describe how a multi core CPU executes code.  Is message passing built in, or do they all just start running at the same time?  I guess I don't really understand that level of computer architecture, but I'd like to get a feel for it.

Cores execute the instructions in their cache right?  They can fetch instructions from memory to their cache?  So how does the computer prevent two cores from executing the same code at the same time?  Does each core start at a different instruction address or something??

I'm sorry I just don't quite understand who tells which core where to start and such.

### #5Khatharr  Crossbones+   -  Reputation: 2589

Like
0Likes
Like

Posted 29 December 2012 - 10:52 PM

1) The workload is masked by the fact that the main thread is not using the resources the hyper-thread is using. Like if you're carrying rocks from one pile to another and you pick up a small rock in one hand, you can pick up another small rock in your free hand because you're not using your full strength yet.

2) RAID stands for "Redundant Array of Individual Disks", so probably not.

3) Yes, all cores have access to the system bus and cache. I don't know what you mean by 'they run'. Cores don't run. They're physical objects on the CPU die. The OS handles core distribution, since it controls the thread scheduler.

4) The thread scheduler locks the core, sets its registers, including the flags and instruction pointer and then unlocks it. The core will then execute instructions normally, ignoring the other cores unless its given a special signal. When it's time to schedule a new thread the scheduler just locks the core again, captures the registers and then sets them to the values of the next thread and unlocks.

As for the cache, the core is only vaguely aware of it. At the level of assembly you usually don't mess with the cache unless you have to. (Actually I don't recall reading about any instructions for messing with the instruction cache, but I haven't got too far into messing with the cache yet.) It's a separate unit that just streamlines the process of instruction fetching so that you don't have to wait on the bus as often.

Typically each core would start at a different address, yes, but there's no reason that more than one core can't be running the same instruction at the same time. The instruction stream is typically on a memory page with read-only access. The data that gets modified is either on the heap page (dynamic memory) or on the stack page (local variables and the function stack).

If you get comfortable with assembly and threading then you'll have an easier time with it.

Edited by Khatharr, 29 December 2012 - 11:07 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

### #6Blind Radish  Members   -  Reputation: 355

Like
0Likes
Like

Posted 30 December 2012 - 02:33 AM

1) Thanks, but unfortunately I still don't get it.  The main thread isn't working??  Is this to do with fetching memory delays?  Thanks for being patient with me.

2) I'm mixing up my Acronyms again.

3) By run I mean execute instructions on data.  Does the OS tell the other cores what to execute?  Does one core execute the OS?  That doesn't seem right at all.  The OS can run on multiple cores, so how does each core know which part of the OS to run?  Does another core tell them?  So does it all start with one core??

I'm trying to do exactly that, get comfortable with them.  I'm just not getting how assembly can multithread.  How does one instruction ensure it's executed only on one core?  I don't mean two cores using the same instructions, I mean the same instruction on the same data at the same time.  I don't mean memory management, I mean how does one core know it's supposed to do something different than another core?

4) I actually thought that the "registers" were a special portion of the "cache", but that doesn't really make sense anymore lol.  I was thinking it's rather important to be able to tell a core to purge it's cache for locking/unlocking functionality.  You know, to ensure a fast access copy is up to date with the rest of the memory.

### #7shadowomf  Members   -  Reputation: 319

Like
0Likes
Like

Posted 30 December 2012 - 04:35 AM

Warning, a bit off-topic:

each core is a full CPU

No, not necessarily, often not. It depends on the vendor and what he defines as a core.

It's not uncommon to share some parts between cores. It's cheaper (less silicon = more money), can save energy and sometimes it doesn't even mean that the performance is reduced.

One example is the UltraSPARC T1, it had 8 cores, but only a single floating-point unit. They where aiming at webservers and databases and didn't felt they need much floating-point computation. Of course for their target market it was correct, on the other side it meant the CPU was useless fpr HPC.

It's an old example and one of the more extreme ones, but even nowadays you can't expect 4/6/8/12/16/whatever cores to be the same as having the same number of complete processors.

On the other hand, having them (the cores) all on one die could mean that it's easier for them processor manufacturers to implement some kind of work stealing or similar things which could offset the lack of a few processing units.

Just take a look at wikipedia for more details on it. For example the bulldozer microarchitecture has something that is between Hyperthreading and having two actual cores. And there are many more of such interesting examples.

Back on topic:

1) Thanks, but unfortunately I still don't get it. The main thread isn't working?? Is this to do with fetching memory delays? Thanks for being patient with me.

Maybe this will help: (https://en.wikipedia.org/wiki/Instruction_pipeline)

Imagine one thread can't fille the pipeline and the processing units (ALU, FPU) don't have enough work to do. Hyperthreading just makes the processing units do the work of another thread while they are waiting for new instructions.

How does one instruction ensure it's executed only on one core? I don't mean two cores using the same instructions, I mean the same instruction on the same data at the same time.

You mean how does code get threadsafe? On x86/x64 by using the lock prefix, making the instruction atomic. Don't know how it's implemented in the hardware.

You also have to distinguish between read and writes, for example it's no problem for multiple cores to read the same data. However if the data is also written to it get's complicated.

I mean how does one core know it's supposed to do something different than another core?

If all is the same (same data, same instruction, same flags) it's exspected to do the same. Or not?

### #8Matias Goldberg  Crossbones+   -  Reputation: 2565

Like
2Likes
Like

Posted 30 December 2012 - 12:58 PM

1) You would think Hyperthreading slows down the execution, because it's having to load and unload data for each operation. Maybe not in certain test cases though.

1) The workload is masked by the fact that the main thread is not using the resources the hyper-thread is using. Like if you're carrying rocks from one pile to another and you pick up a small rock in one hand, you can pick up another small rock in your free hand because you're not using your full strength yet.

True and true.
Once an Intel engineer told me if the system is fully using the memory bandwidth, Hyperthreading can't kick in. Once said out loud, looks obvious, but it's not something you would realize quickly.

2) So each core can do a few things, and this is RAID right?

RAID is for disks.

3) Each core can fetch instructions out of memory freely?  And they can make changes to the memory freely?

Yes. They ask the memory controller, and it will try to honour by order of arrival. If they arrive at the exact same time then some predefined order is taken (i.e. it's like two guys trying to grab the same tool at the same time; and the same guy always says to the other one "you go first")

Without any kind of synchronization, this "make changes to the memory freely" behaviour can be problematic and chaotic. For example core A reads the value "0" and adds 1, then stores it. Core B does the same thing. If they both read the value at the same time, by the time they both finish they will write the result "1", while the probably expected behaviour is that the result should be "2".
It's called atomic operations. If an operation is atomic, then by the time the other core tries to access it, the other core already loaded it, modified it, and stored the result back.
Non-atomic operations don't have that guarantee. For example writing to unaligned memory in x86 is not atomic. So if I write the float "1.0f" to an unaligned address in memory from core A, while trying to read it from core B it may be a mixture of "1.0f" and whatever value it used to have. The value that core B reads would be complete garbage (could be 1.0f, a NaN, an infinite, any random number). The only way to fix that is ensuring there are only aligned reads.
But more complex operations (like reading multiple numbers) aren't atomic either (regardless of alignment) so if you don't use synchronization primitives provided by the OS or the HW, it's going to fail sooner or later.

Non-atomic operations can be seen like peeling a carrot while another guy grabs it and boil it, without even realizing his coworker wasn't finishing peeling it.

So they all just run all at the start of the program?

They start whenever the program requests another thread to the OS

And if you do two unrelated things they can multithread them?

Best multithreading scenario since there is no shared data, no need to synchronize at all. It's like one guy doing laundrey while the other one cooks. Or two guys cooking dinner in different kitchens.

I thought that was up to the compiler to handle planning for that kind of thing, but I wouldn't know how it would if it could.

Threading is so complex that compilers can't handle it. There is active research now on compiler automatically trying to thread trivial scenarios, but nothing concrete today.
What is popular are compiler threading extensions that with a few keywords (and by following lots of strict rules), you're telling the compiler "this is threadable, run it in parallel" the compiler checks you're following most of the rules, and tries to thread it. It doesn't work for very advanced stuff though.

4) Ignoring virtual cores and single core multithreading, how does one describe how a multi core CPU executes code.  Is message passing built in, or do they all just start running at the same time?  I guess I don't really understand that level of computer architecture, but I'd like to get a feel for it.

They just run at the same time. But there is some wiring between each core:

• Instructions with lock prefix. They're guaranteed to be atomic. The CPU sends a signal to a common place, and this place marks the memory region to be operated on as "locked" until the CPU sends the "finished" signal. If another signal comes from another thread for the same memory region, this place tells the CPU to stall, and later tells to resume. This happens whithin a couple cycles and is very fast, but only works for simple operations like, addition, substraction, multiplication, etc.
• Kernel 0 synchronization instructions. Only available in Kernel 0, those instructions tell the CPU to pause and resume operations (i.e. Core A tells core B "wait for me to finish until I say so!"). This is how the OS can implement mutexes and other synchronization mechanism. You have to ask the OS, because normal applications don't have Kernel 0 access. If they had, viruses could hang your PC very easily (just tell all cores to pause indefinitely)
• Internal stuff. For example the Core i7 has sync. mechanism in the L3 to reduce performance hit by something called "false cache sharing". This is REALLY advanced stuff, so I won't go into detail about it.
Cores execute the instructions in their cache right?  They can fetch instructions from memory to their cache?  So how does the computer prevent two cores from executing the same code at the same time?  Does each core start at a different instruction address or something??

Instructions are first loaded from Memory RAM and fetched into L1 cache. The CPU executes the code in the L1 cache.
First of all, it may be the case that the computer actually wants to execute the same code at the same time, so they just fetch the same memory RAM into each core's L1 cache. This is perfectly legal. They may want to work on different data though. It's like running two instances of Media Player (same code) but viewing different movies (different data). When you ask the OS to start a thread, you're allowed to pass an integer with it. You can store in this integer whatever data you want, including a pointer to memory "i.e. I want you to work on THIS data I'm sending you"

As for your question "how does the computer prevent two cores from executing the same code at the same time", If you really want to prevent this, it has been answered above: Using the OS functions (like mutexes) that map to Kernel 0 instructions.
But 99% of the time, what you really want is to prevent two cores from working on the same DATA. Code != Data. i.e. Those two media players instances may be playing different movies, but they will share the same settings file. Only one can have read & write access to it, while the other one can only have read access (or no access at all, and resort to factory defaults) until the other instance releases the write lock on the file.

Multithreading isn't for beginners. By nature they tend to be chaotic, and many "fail" scenarios appear that transistors have to be dedicated to fix those. For example if Core A modifies X memory address, and this address was already loaded into all 4 cores' own L1 or L2 caches, the cache must be updated with the new value so they don't work with outdated data.

Thus, the memory controller must keep track of what memory regions are mapped in the caches and see if the address X about to be modified was loaded by other caches. But caches can't just update one single value, they work by reading "lines" (chunks of adjacent memory, in L1, typically each line is 64 bytes) so they have to load the whole line again. If this happens too often, performance can be severely affected.

Examples like this multiplicate, hence the complexity in HW and SW about multithreading.

GPUs for example are just raw power and don't care about anything about this cache "out of date" thing. If you modified X and didn't flush the caches, it's your fault and caches will be out of date. May be you don't need the value to be up to date, and you working with older values is perfectly fine or just tolerable/acceptable. And of course, performance will be bigger because the HW isn't doing anything behind your backs to ensure you don't mess up.

Edited by Matias Goldberg, 30 December 2012 - 01:07 PM.

### #9Khatharr  Crossbones+   -  Reputation: 2589

Like
3Likes
Like

Posted 30 December 2012 - 03:20 PM

Ah, instead of trying to respond to each question individually I'll try to walk through threading on a single core. Then it's easy to see how additional cores can easily be added.

First let's talk about the general layout of a single-core CPU.

You have the ALU, which is the 'brain' that does the actual work, such as adding and subtracting or etc. When an electrical pulse hits the CPU the current instruction that's loaded into the instruction stream is used as a set of on/off gates that control which parts of the ALU get triggered. On bits pass on current and off bits don't. That current passes through the ALU and causes some change in the internal state of the CPU based on what instruction is being processed. Attached to the ALU are the CPU's registers.

The CPU will have several registers, depending on what type it is. If a CPU is said to be 32-bit that means that each GPR (general purpose register) is 32-bits in length. A 64-bit CPU has 64-bit registers, etc. On an x86 you have 4 registers to play with data: eax, ebx, ecx, and edx. They have different uses depending on the situation (which instruction you want to execute), but you can more or less load or store data to them as you please. In addition to these there are some other registers for special uses:

(These four are also considered part of the GPR and can be accessed directly by 'mov' instruction.)

ESP - extended stack pointer - points to the top of the stack

EBP - extended base register - points to the base of the stack (the point at which the current function's stack space begins... sort of)

ESI/EDI - extended source/destination index - used for making memory transfers or other looping memory tasks more efficient. Some instructions use these to loop through memory in either a forward or backward direction. Can be very useful in such situations.

Special, non-GPR registers:

EFLAGS - This register is used as a set of bits that indicate the state of the CPU. Instructions can set the different bits or react to the state of them, but direct access to EFLAGS is not possible (you can get at it, but it takes some foolery).

EIP - extended instruction pointer - points to the memory location of the next instruction to be executed.

Segment registers:

These registers are not as commonly discussed, but have to do with the virtual memory management model. Essentially there's a few registers that can be used to point to a specific location in memory, called a 'segment' or 'page'. When the CPU is in virtual memory mode and gets an address, that address refers to a memory offset within one of those pages. Generally one page will point to the memory image of the module (the exe or dll being executed), one will point to the stack, and one will point to the heap. Others can also be available depending on the program. Windows calls these 'pages' and has a set of functions for managing them. Pages can have read/write/execute permissions set and will throw an exception if those permissions are violated, such as good old 0xC0000005 aka the segfault (segment fault - get it?).

There are also SIMD registers on most modern CPUs. These tend to be larger than the GPR and are often also used as the floating-point registers. They implement things like SSE or MMX and the like.

A simple ASM program to add 1 to a variable looks like this:

mov eax,variable #load the value of 'variable' to the eax register

inc eax #increment (increase by 1) the eax register

mov variable,eax #store the value in the eax register in the memory belonging to 'variable'

This is where the cache would come in. When the CPU wants to read a value to memory it sends a request to the system bus. The bus interprets the request, finds the memory and sends it back to the CPU. This takes several CPU cycles to do, so the CPU ends up spending most of its time waiting for the value to arrive before it can continue. In order to solve this problem modern CPUs have an onboard cache. The CPU tries to predict what memory you're going to want and ask the bus for it ahead of time. Then when you go to load the value it's (hopefully) already on hand and you don't have to wait for the bus. Changes to the value are stored in the cache until the CPU determines that you're done with that data, then it flushes it back to main memory. On systems where more than one device have access to main memory this can cause problems, since the CPU can work on some data and then still have it cached - the other device may look in main memory and not see the changes that haven't been flushed from the cache yet. The 'volatile' keyword in C/C++ prevents the cache from holding the specified data. It always reads that data from main memory and always writes changes to main memory. The instruction stream typically doesn't need to know anything about what the cache is doing except in those cases. The cache more or less takes care of itself. (It actually reads the results of the CPU's operations in order to try and decide which instructions/data should be pre-fetched.)

A brief look at the stack comes next. To enter a function you typically use the 'call' instruction and give it an address:

void myFunc(char* message) {
printf(message);
}

int main() {
char myStr[] = "Hello.";
__asm {
push myStr;
call myFunc;
};
}

When a process starts it allocates a section of memory for the stack. The base pointer and stack pointer are set to the end of that memory. When a value is 'pushed' it gets written to the location pointed to by the stack pointer, then the stack pointer is reduced to make room for the next value. So if I say 'push eax', the value in eax gets written to the address pointed to by ESP, then ESP is reduced by 4 (meaning 4 bytes). When I say 'pop eax' the stack pointer is increased by 4 and then the value it points to is placed in eax. Memory which is at an address lower than ESP is considered 'invalid'. It can contain anything at all and should not be referred to unless you're doing something weird.

To call a function, first the arguments are pushed in reverse order, then the 'call' instruction is used. 'call' does the following:

push EIP to the stack (store the address of the next instruction - the one after the 'call')

jump to the label/address provided as an argument to 'call' (in the example it was the label 'myFunc')

The function called will usually set up its 'stack frame' like so:

push ebp (push the current base pointer to the stack)

mov ebp,esp (set the base pointer to point to the current stack pointer position)

sub esp,??? (where ??? is the total size of the function's local variables)

So what just happened there?  Well, ebp gets pushed. This gives us a value to set it back to when we exit the function.

Then we set ebp's new value to the current stack position. This means that ebp now points to the previous frame's ebp. If we need to unwind the stack we've got a chain of pointers that all point right down the chain to our original ebp.

After that we subtract the local storage size from esp, meaning that we've reserved space for our local variables on the stack. The compiler knows which address in that space belongs to which variable.

At this point the following is true:

• The function arguments begin at ebp+8 and increase upward (since we pushed them in reverse order prior to 'call')
• Local variable storage begins at ebp-4 and increases downward
• Local storage is not necessarily in the order of its C declaration. The compiler is free to order things how it wants within the stack frame.

Once the function is done it wants to return control to the caller. It does the following:

If there's a return value then place it in the eax register

Move the value in ebp to esp - this puts the stack pointer back at the point it was at when we entered the function, effectively freeing the stack frame's memory

Pop the stack into ebp - restore ebp to the value it had prior to 'call'. now it points to the base of the caller's stack frame again

'ret' - this instruction basically pops into EIP, moving the instruction pointer to the position of the instruction after our original 'call'

At this point the arguments that we pushed prior to 'call' would still be on the stack. There's two ways to handle this:

Provide a numeric argument when calling 'ret' - This value will be added to ESP after EIP is popped, effectively freeing the pushed args.

Just manually add the length of the args to ESP after the function returns. - This has to be done in cases where the length of the args is unknown to the function, such as in the case of printf, which can have a variable number of arguments.

Okay, so that's very complicated unless you just sit there and watch it in action for a while in the debugger.

Now we're almost to threading. The next concept is interrupts. A program (such as the thread scheduler) can set up a series of instructions in memory and then tell the CPU to register the start address there as an interrupt routine. The routine is matched to an interrupt number and remembered until the CPU is told otherwise. When the CPU gets an interrupt signal it will have a number accompanying it. The CPU records all registers, including EFLAGS, EIP and the segment pointers, then transfers control to the routine matching the interrupt number it just got. When the interrupt routine is done it restores all the registers and the program resumes.

So you may now have a notion of how a thread scheduler will work. A routine is set up as an interrupt routine, but instead of returning to the interrupted thread it saves the register state and then returns to the state saved from the thread that's scheduled next. Interrupts can also be set to trigger on timers, which makes threading easier to implement.

In the case of a multi-core system the scheduler just has the option of choosing which core to interrupt and give the next thread to. Apart from memory access concerns the cores don't really need to know anything about one another. They just keep executing the instruction at EIP until they get an interrupt.

All that's somewhat simplified, though. There's a lot of tiny details and craziness that goes on in there that can keep you busy for many sleepless nights.

Still, it's fun, right?

Edited by Khatharr, 30 December 2012 - 03:27 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

### #10Matias Goldberg  Crossbones+   -  Reputation: 2565

Like
0Likes
Like

Posted 30 December 2012 - 06:26 PM

Wow, that was quite an explanation that one Khatharr wrote.

Just one detail...
The 'volatile' keyword in C/C++ prevents the cache from holding the specified data. It always reads that data from main memory and always writes changes to main memory.
Volatile in C/C++ won't prevent or hint the cache. It just prevents the compiler from optimizing whatever is volatile; this is useful if for example you have memory mapped hardware and setting the same address three times to the same value is actually meaningful, the compiler doesn't just eliminate the 2 extra sets (i.e. it will compile into three mov instructions).
The volatile keyword will force to always load and store from memory on each operation, but that doesn't mean the cache won't try to... cache it.

### #11Khatharr  Crossbones+   -  Reputation: 2589

Like
0Likes
Like

Posted 30 December 2012 - 07:00 PM

Ah, good to know. I only had to worry about cache ops on my REDACTED when I was REDACTED it because the GPU used DMA, so if you swizzled a texture for faster GPU speeds (or otherwise modified it with the CPU) you had to call a writeback function on the specific memory range that you modified. Alternatively it had an uncached mirror of main memory starting at 0x80000000 IIRC.

Edited by Khatharr, 30 December 2012 - 07:10 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

### #12MJP  Moderators   -  Reputation: 8508

Like
1Likes
Like

Posted 30 December 2012 - 07:46 PM

The third one, hyperthreading, was invented by Intel, and a fake of the first one (multicore). When the CPU is only using one of the pipelines described as instruction-level parallelism and the OoOE couldn't do anything to prevent it, we've got a lot of idle pipelines.

Because those pipelines are "almost" an extra core, Hyperthreading kicks in and simulates an extra core to execute a parallel thread, but is not near in the same level as a real additional core because: 1. Not all parts of the pipeline are duplicated and 2. The main thread may actually start using those unused pipelines, leaving no spare components to dedicate to the hyperthreaded thread.
Intel claims Hyperthreading yields an average gain of 15% in performance.
Btw, if you open Task Manager in a single core system with hyperthreading, it will tell you there are two cores (the real one + fake one). In Intel Core i7 with 8 cores, it will tell you there are 16 cores (8 real ones + 8 fake ones)
Not all CPUs come with HT, in fact, most of them don't.
Just to add to this...the generic non-Intel term for this technique is "simultaneous multithreading", or "SMT" for short.

Edited by MJP, 30 December 2012 - 07:46 PM.

### #13Khatharr  Crossbones+   -  Reputation: 2589

Like
0Likes
Like

Posted 30 December 2012 - 10:12 PM

Correction: esp points to the last pushed value. When a value is pushed esp is decreased and then writes. When a value is popped the value is read and then esp is increased. I incorrectly stated that the increase/decrease would occur in such a way that would leave esp pointing to invalid memory. Also, interrupt only pushes some of the registers (including eflags). The interrupt routine is responsible for pushing the additional ones.

Edited by Khatharr, 30 December 2012 - 10:16 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

### #14Blind Radish  Members   -  Reputation: 355

Like
0Likes
Like

Posted 30 December 2012 - 10:37 PM

Thanks Everyone.  This stuff is pretty complex.

Alright Khatharr, you've signed yourself up for some questions.

Is cache management, like flushing a specific variable, is that manual?
Is interuption manual?
By manual I mean, could a compiler insert an instruction to do this?

Or is it all automatic?

And the important question, the main question I've been wondering about.

A compiler can't build code that works outside the OS?
Once the OS boots it takes control or what?

So an exe for example, does it start with some general notion of where the code starts at?
And I'm assuming it starts at a single point and can only branch out by controlling the scheduler?

Just explain to me how I would, in assembly, instruct the computer to carry out two different threads at once.
How I would say something basic like "do A, B, and C in seperate caches.  when you've done A and B, do D."

If I could just understand how the computer knows what to multithread and how it communicates that, I'd be good.

But again, thanks everyone for the details.  This will really help out my work!
It's crucial to my success to understand this, I just want you all to know.

### #15Khatharr  Crossbones+   -  Reputation: 2589

Like
1Likes
Like

Posted 30 December 2012 - 11:40 PM

Is cache management, like flushing a specific variable, is that manual?

Depending on the platform. You shouldn't have to mess with it in Windows, for instance. Platforms where it's useful will generally have API functions for managing it. It's only necessary when you have DMA going on and you have a situation where a DMA device may want to use memory that's cached on the CPU. It's actually quite rare, even in those situations. I think I only ever used it for the texture swizzle function and the texture hue-change function on aforementioned device. I just had to put a writeback call at the end of those functions and everything worked fine.

Interrupts can be signaled from asm. The instruction is INT on x86. http://pdos.csail.mit.edu/6.828/2004/readings/i386/s02_06.htm

(Edit - If you pick a fight with the thread scheduler the skeleton man will eat you.)

With the virtual memory model programs are more or less compiled as though they have exclusive access to the machine. The thread scheduler interacts with them from the outside. It's sort of like an emulator save-state. If you have the register states recorded and the memory recorded then you can restore both and resume execution as if there had been no interruption. VMM allows this because it can easily swap out pages for different processes, so each process basically has its own memory within the span of real-address-mode memory (or the swap file, depending on memory conditions). Since the stack for a thread is stored in that thread's stack page the register states can be pushed to that stack and then the stack page can be swapped to that of the next thread and the registers popped. This restores both the memory and the register states of the upcoming thread, so it continues as if nothing had ever happened.

When a PC boots it has a specific location that it looks to, usually on the primary disk volume, to find a startup module. The BIOS handles getting everything out of bed and then does basic loading to get the ball rolling. After that the startup module loads the OS and everything snowballs into place. On other platforms (like REDACTED that I was mentioning) the platform only runs one program at a time and the OS is just the first program to start. When it wants to run a program it places the path of the program in a specific place in memory then does a sort of soft-boot and the indicated program gets control of more or less the whole device.

Running an exe, the Windows loader does some pretty interesting black magic to map the different parts of the exe file into memory. The 'entry point' is calculated based on some values in structs near the beginning of the exe file. It's a pretty horrific mess in there TBH. You can read some about it here, but it may make you want to claw your eyes out. Basically the exe includes information about how big of a stack it needs and how big of a heap, etc. The loader gets it the pages it needs and then just sets eip to the entry point and lets it do what it wants. Sometimes it has to fiddle with some relative addresses within the file image in memory first, but VMM can usually eliminate that.

Just explain to me how I would, in assembly, instruct the computer to carry out two different threads at once.
How I would say something basic like "do A, B, and C in seperate caches.  when you've done A and B, do D."

That's more basic concurrency than threading. You'd mean 'cores' or possibly 'fibers' there rather than 'caches'. This is actually a better concurrency model than threading in many cases.

I'd really like to see what you're talking about implemented in C++, though I hear rumors that some compilers have custom features for it.

Starting an actual thread in ASM is no different than doing so in C/C++. You have to invoke the operating system's functions for it. I think what you may want in this situation is just a task queue that feeds into more than one thread for execution, but I suspect that you may not get the results you're looking for. The thing about the Windows thread scheduler is that Windows usually has a pretty large number of threads running all over the place. It's not just the program in the foreground that's getting scheduled. The kernel needs to do work, programs in the background are doing work, etc, etc. So even if you split up work using threads you're not really getting pure concurrency since those threads will basically execute when they damn well please. Your work won't necessarily be paralleled and you'll probably end up waiting on results when you don't need to.

In other words, you may be a little ahead of the state of the art there, but people are working on it. If anyone knows about these rumored compilers that offer keywords for concurrency I'd like to know about them as well.

Edited by Khatharr, 30 December 2012 - 11:42 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

### #16Tribad  Members   -  Reputation: 452

Like
0Likes
Like

Posted 31 December 2012 - 05:00 AM

I would suggest to read the intel documentation about their processors. There are a bunch of books some dealing with system programming because this is what you are asking here. From the history of computers and their individual operating systems there are a lot of solutions for handling multiple execution pathes. Some used segmented memory management schemes noadays you have paged memory management in most places. Because you have so many questions how things work in detail that I even suggest to learn how the old things worked. They started with more simple architectures and started to extend them step by step. This way your understanding grows the same way.

I have an older documentation available but also some very old books about intel i486 and Motorola 68000 CPUs. These are old but so trivial that they are easy to understand.

Edited by Tribad, 31 December 2012 - 05:28 AM.

### #17Codarki  Members   -  Reputation: 462

Like
0Likes
Like

Posted 31 December 2012 - 10:38 AM

Sorry to skim much of the answers here, so this might be redundant.

All threads can be running the same assembly code instructions simultaneously without a problem.

The code they're running will access some memory. This memory must be synchronized. Each thread has own registers for the code, but if the code shares the same memory, you must deal with it.

### #18Tribad  Members   -  Reputation: 452

Like
0Likes
Like

Posted 31 December 2012 - 11:13 AM

Running a couple of processes is the same management than running a couple of threads

The difference is that all threads of a process share the same data segment, except for thread local storage but that is another story, and can access and manipulate any other threads data.

This is why a synchronization mechanism must exist to not garble the data. Explained in this thread some posts ago.

The old way to get something like threads was to create a bunch of processes, each isolated in his own  address space, and than create a shared memory segment that is used to share data in a direct way.

The difference is that managing the synchronization between processes needs more processor cycles that synchronizing a bunch of threads. Threads are comfortable. If you understand how the memory is organized in code, data and stack and how the security mechanisms work it is an easy thing to understand threads.

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS