• Create Account

### #ActualKhatharr

Posted 30 December 2012 - 03:27 PM

Ah, instead of trying to respond to each question individually I'll try to walk through threading on a single core. Then it's easy to see how additional cores can easily be added.

First let's talk about the general layout of a single-core CPU.

You have the ALU, which is the 'brain' that does the actual work, such as adding and subtracting or etc. When an electrical pulse hits the CPU the current instruction that's loaded into the instruction stream is used as a set of on/off gates that control which parts of the ALU get triggered. On bits pass on current and off bits don't. That current passes through the ALU and causes some change in the internal state of the CPU based on what instruction is being processed. Attached to the ALU are the CPU's registers.

The CPU will have several registers, depending on what type it is. If a CPU is said to be 32-bit that means that each GPR (general purpose register) is 32-bits in length. A 64-bit CPU has 64-bit registers, etc. On an x86 you have 4 registers to play with data: eax, ebx, ecx, and edx. They have different uses depending on the situation (which instruction you want to execute), but you can more or less load or store data to them as you please. In addition to these there are some other registers for special uses:

(These four are also considered part of the GPR and can be accessed directly by 'mov' instruction.)

ESP - extended stack pointer - points to the top of the stack

EBP - extended base register - points to the base of the stack (the point at which the current function's stack space begins... sort of)

ESI/EDI - extended source/destination index - used for making memory transfers or other looping memory tasks more efficient. Some instructions use these to loop through memory in either a forward or backward direction. Can be very useful in such situations.

Special, non-GPR registers:

EFLAGS - This register is used as a set of bits that indicate the state of the CPU. Instructions can set the different bits or react to the state of them, but direct access to EFLAGS is not possible (you can get at it, but it takes some foolery).

EIP - extended instruction pointer - points to the memory location of the next instruction to be executed.

Segment registers:

These registers are not as commonly discussed, but have to do with the virtual memory management model. Essentially there's a few registers that can be used to point to a specific location in memory, called a 'segment' or 'page'. When the CPU is in virtual memory mode and gets an address, that address refers to a memory offset within one of those pages. Generally one page will point to the memory image of the module (the exe or dll being executed), one will point to the stack, and one will point to the heap. Others can also be available depending on the program. Windows calls these 'pages' and has a set of functions for managing them. Pages can have read/write/execute permissions set and will throw an exception if those permissions are violated, such as good old 0xC0000005 aka the segfault (segment fault - get it?).

There are also SIMD registers on most modern CPUs. These tend to be larger than the GPR and are often also used as the floating-point registers. They implement things like SSE or MMX and the like.

A simple ASM program to add 1 to a variable looks like this:

mov eax,variable #load the value of 'variable' to the eax register

inc eax #increment (increase by 1) the eax register

mov variable,eax #store the value in the eax register in the memory belonging to 'variable'

This is where the cache would come in. When the CPU wants to read a value to memory it sends a request to the system bus. The bus interprets the request, finds the memory and sends it back to the CPU. This takes several CPU cycles to do, so the CPU ends up spending most of its time waiting for the value to arrive before it can continue. In order to solve this problem modern CPUs have an onboard cache. The CPU tries to predict what memory you're going to want and ask the bus for it ahead of time. Then when you go to load the value it's (hopefully) already on hand and you don't have to wait for the bus. Changes to the value are stored in the cache until the CPU determines that you're done with that data, then it flushes it back to main memory. On systems where more than one device have access to main memory this can cause problems, since the CPU can work on some data and then still have it cached - the other device may look in main memory and not see the changes that haven't been flushed from the cache yet. The 'volatile' keyword in C/C++ prevents the cache from holding the specified data. It always reads that data from main memory and always writes changes to main memory. The instruction stream typically doesn't need to know anything about what the cache is doing except in those cases. The cache more or less takes care of itself. (It actually reads the results of the CPU's operations in order to try and decide which instructions/data should be pre-fetched.)

A brief look at the stack comes next. To enter a function you typically use the 'call' instruction and give it an address:

void myFunc(char* message) {
printf(message);
}

int main() {
char myStr[] = "Hello.";
__asm {
push myStr;
call myFunc;
};
}

When a process starts it allocates a section of memory for the stack. The base pointer and stack pointer are set to the end of that memory. When a value is 'pushed' it gets written to the location pointed to by the stack pointer, then the stack pointer is reduced to make room for the next value. So if I say 'push eax', the value in eax gets written to the address pointed to by ESP, then ESP is reduced by 4 (meaning 4 bytes). When I say 'pop eax' the stack pointer is increased by 4 and then the value it points to is placed in eax. Memory which is at an address lower than ESP is considered 'invalid'. It can contain anything at all and should not be referred to unless you're doing something weird.

To call a function, first the arguments are pushed in reverse order, then the 'call' instruction is used. 'call' does the following:

push EIP to the stack (store the address of the next instruction - the one after the 'call')

jump to the label/address provided as an argument to 'call' (in the example it was the label 'myFunc')

The function called will usually set up its 'stack frame' like so:

push ebp (push the current base pointer to the stack)

mov ebp,esp (set the base pointer to point to the current stack pointer position)

sub esp,??? (where ??? is the total size of the function's local variables)

So what just happened there?  Well, ebp gets pushed. This gives us a value to set it back to when we exit the function.

Then we set ebp's new value to the current stack position. This means that ebp now points to the previous frame's ebp. If we need to unwind the stack we've got a chain of pointers that all point right down the chain to our original ebp.

After that we subtract the local storage size from esp, meaning that we've reserved space for our local variables on the stack. The compiler knows which address in that space belongs to which variable.

At this point the following is true:

• The function arguments begin at ebp+8 and increase upward (since we pushed them in reverse order prior to 'call')
• Local variable storage begins at ebp-4 and increases downward
• Local storage is not necessarily in the order of its C declaration. The compiler is free to order things how it wants within the stack frame.

Once the function is done it wants to return control to the caller. It does the following:

If there's a return value then place it in the eax register

Move the value in ebp to esp - this puts the stack pointer back at the point it was at when we entered the function, effectively freeing the stack frame's memory

Pop the stack into ebp - restore ebp to the value it had prior to 'call'. now it points to the base of the caller's stack frame again

'ret' - this instruction basically pops into EIP, moving the instruction pointer to the position of the instruction after our original 'call'

At this point the arguments that we pushed prior to 'call' would still be on the stack. There's two ways to handle this:

Provide a numeric argument when calling 'ret' - This value will be added to ESP after EIP is popped, effectively freeing the pushed args.

Just manually add the length of the args to ESP after the function returns. - This has to be done in cases where the length of the args is unknown to the function, such as in the case of printf, which can have a variable number of arguments.

Okay, so that's very complicated unless you just sit there and watch it in action for a while in the debugger.

Now we're almost to threading. The next concept is interrupts. A program (such as the thread scheduler) can set up a series of instructions in memory and then tell the CPU to register the start address there as an interrupt routine. The routine is matched to an interrupt number and remembered until the CPU is told otherwise. When the CPU gets an interrupt signal it will have a number accompanying it. The CPU records all registers, including EFLAGS, EIP and the segment pointers, then transfers control to the routine matching the interrupt number it just got. When the interrupt routine is done it restores all the registers and the program resumes.

So you may now have a notion of how a thread scheduler will work. A routine is set up as an interrupt routine, but instead of returning to the interrupted thread it saves the register state and then returns to the state saved from the thread that's scheduled next. Interrupts can also be set to trigger on timers, which makes threading easier to implement.

In the case of a multi-core system the scheduler just has the option of choosing which core to interrupt and give the next thread to. Apart from memory access concerns the cores don't really need to know anything about one another. They just keep executing the instruction at EIP until they get an interrupt.

All that's somewhat simplified, though. There's a lot of tiny details and craziness that goes on in there that can keep you busy for many sleepless nights.

Still, it's fun, right?

### #1Khatharr

Posted 30 December 2012 - 03:20 PM

Ah, instead of trying to respond to each question individually I'll try to walk through threading on a single core. Then it's easy to see how additional cores can easily be added.

First let's talk about the general layout of a single-core CPU.

You have the ALU, which is the 'brain' that does the actual work, such as adding and subtracting or etc. When an electrical pulse hits the CPU the current instruction that's loaded into the instruction stream is used as a set of on/off gates that control which parts of the ALU get triggered. On bits pass on current and off bits don't. That current passes through the ALU and causes some change in the internal state of the CPU based on what instruction is being processed. Attached to the ALU are the CPU's registers.

The CPU will have several registers, depending on what type it is. If a CPU is said to be 32-bit that means that each GPR (general purpose register) is 32-bits in length. A 64-bit CPU has 64-bit registers, etc. On an x86 you have 4 registers to play with data: eax, ebx, ecx, and edx. They have different uses depending on the situation (which instruction you want to execute), but you can more or less load or store data to them as you please. In addition to these there are some other registers for special uses:

(These four are also considered part of the GPR and can be accessed directly by 'mov' instruction.)

ESP - extended stack pointer - points to the top of the stack

EBP - extended base register - points to the base of the stack (the point at which the current function's stack space begins... sort of)

ESI/EDI - extended source/destination index - used for making memory transfers or other looping memory tasks more efficient. Some instructions use these to loop through memory in either a forward or backward direction. Can be very useful in such situations.

Special, non-GPR registers:

EFLAGS - This register is used as a set of bits that indicate the state of the CPU. Instructions can set the different bits or react to the state of them, but direct access to EFLAGS is not possible (you can get at it, but it takes some foolery).

EIP - extended instruction pointer - points to the memory location of the next instruction to be executed.

Segment registers:

These registers are not as commonly discussed, but have to do with the virtual memory management model. Essentially there's a few registers that can be used to point to a specific location in memory, called a 'segment' or 'page'. When the CPU is in virtual memory mode and gets an address, that address refers to a memory offset within one of those pages. Generally one page will point to the memory image of the module (the exe or dll being executed), one will point to the stack, and one will point to the heap. Others can also be available depending on the program. Windows calls these 'pages' and has a set of functions for managing them. Pages can have read/write/execute permissions set and will throw an exception if those permissions are violated, such as good old 0xC0000005 aka the segfault (segment fault - get it?).

There are also SIMD registers on most modern CPUs. These tend to be larger than the GPR and are often also used as the floating-point registers. They implement things like SSE or MMX and the like.

A simple ASM program to add 1 to a variable looks like this:

mov eax,variable #load the value of 'variable' to the eax register

inc eax #increment (increase by 1) the eax register

mov variable,eax #store the value in the eax register in the memory belonging to 'variable'

This is where the cache would come in. When the CPU wants to read a value to memory it sends a request to the system bus. The bus interprets the request, finds the memory and sends it back to the CPU. This takes several CPU cycles to do, so the CPU ends up spending most of its time waiting for the value to arrive before it can continue. In order to solve this problem modern CPUs have an onboard cache. The CPU tries to predict what memory you're going to want and ask the bus for it ahead of time. Then when you go to load the value it's (hopefully) already on hand and you don't have to wait for the bus. Changes to the value are stored in the cache until the CPU determines that you're done with that data, then it flushes it back to main memory. On systems where more than one device have access to main memory this can cause problems, since the CPU can work on some data and then still have it cached - the other device may look in main memory and not see the changes that haven't been flushed from the cache yet. The 'volatile' keyword in C/C++ prevents the cache from holding the specified data. It always reads that data from main memory and always writes changes to main memory. The instruction stream typically doesn't need to know anything about what the cache is doing except in those cases. The cache more or less takes care of itself. (It actually reads the results of the CPU's operations in order to try and decide which instructions/data should be pre-fetched.)

A brief look at the stack comes next. To enter a function you typically use the 'call' instruction and give it an address:

void myFunc(char* message) {
printf(message);
}

int main() {
char myStr[] = "Hello.";
__asm {
push myStr;
call myFunc;
};
}

When a process starts it allocates a section of memory for the stack. The base pointer and stack pointer are set to the end of that memory. When a value is 'pushed' it gets written to the location pointed to by the stack pointer, then the stack pointer is reduced by the width written. So if I say 'push eax', the value in eax gets written to the address pointed to by ESP, then ESP is reduced by 4 (since eax is 32-bits/4-bytes wide).

To call a function, first the arguments are pushed in reverse order, then the 'call' instruction is used. 'call' does the following:

push EIP to the stack (store the address of the next instruction - the one after the 'call')

jump to the label/address provided as an argument to 'call' (in the example it was the label 'myFunc')

The function called will usually set up its 'stack frame' like so:

push ebp (push the current base pointer to the stack)

mov ebp,esp (set the base pointer to point to the current stack pointer position)

sub esp,??? (where ??? is the total size of the function's local variables)

So what just happened there?  Well, ebp gets pushed. This gives us a value to set it back to when we exit the function.

Then we set ebp's new value to the current stack position. This means that ebp now points to the previous frame's ebp. If we need to unwind the stack we've got a chain of pointers that all point right down the chain to our original ebp.

After that we subtract the local storage size from esp, meaning that we've reserved space for our local variables on the stack. The compiler knows which address in that space belongs to which variable.

At this point the following is true:

• The function arguments begin at ebp+8 and increase upward (since we pushed them in reverse order prior to 'call')
• Local variable storage begins at ebp-4 and increases downward
• Local storage is not necessarily in the order of its C declaration. The compiler is free to order things how it wants within the stack frame.

Once the function is done it wants to return control to the caller. It does the following:

If there's a return value then place it in the eax register

Move the value in ebp to esp - this puts the stack pointer back at the point it was at when we entered the function, effectively freeing the stack frame's memory

Pop the stack into ebp - restore ebp to the value it had prior to 'call'. now it points to the base of the caller's stack frame again

'ret' - this instruction basically pops into EIP, moving the instruction pointer to the position of the instruction after our original 'call'

At this point the arguments that we pushed prior to 'call' would still be on the stack. There's two ways to handle this:

Provide a numeric argument when calling 'ret' - This value will be added to ESP after EIP is popped, effectively freeing the pushed args.

Just manually add the length of the args to ESP after the function returns. - This has to be done in cases where the length of the args is unknown to the function, such as in the case of printf, which can have a variable number of arguments.

Okay, so that's very complicated unless you just sit there and watch it in action for a while in the debugger.

Now we're almost to threading. The next concept is interrupts. A program (such as the thread scheduler) can set up a series of instructions in memory and then tell the CPU to register the start address there as an interrupt routine. The routine is matched to an interrupt number and remembered until the CPU is told otherwise. When the CPU gets an interrupt signal it will have a number accompanying it. The CPU records all registers, including EFLAGS, EIP and the segment pointers, then transfers control to the routine matching the interrupt number it just got. When the interrupt routine is done it restores all the registers and the program resumes.

So you may now have a notion of how a thread scheduler will work. A routine is set up as an interrupt routine, but instead of returning to the interrupted thread it saves the register state and then returns to the state saved from the thread that's scheduled next. Interrupts can also be set to trigger on timers, which makes threading easier to implement.

In the case of a multi-core system the scheduler just has the option of choosing which core to interrupt and give the next thread to. Apart from memory access concerns the cores don't really need to know anything about one another. They just keep executing the instruction at EIP until they get an interrupt.

All that's somewhat simplified, though. There's a lot of tiny details and craziness that goes on in there that can keep you busy for many sleepless nights.

Still, it's fun, right?

PARTNERS