How does Multithreading Assembly work?

Started by
16 comments, last by Tribad 11 years, 4 months ago
Ah, good to know. I only had to worry about cache ops on my REDACTED when I was REDACTED it because the GPU used DMA, so if you swizzled a texture for faster GPU speeds (or otherwise modified it with the CPU) you had to call a writeback function on the specific memory range that you modified. Alternatively it had an uncached mirror of main memory starting at 0x80000000 IIRC.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Advertisement
The third one, hyperthreading, was invented by Intel, and a fake of the first one (multicore). When the CPU is only using one of the pipelines described as instruction-level parallelism and the OoOE couldn't do anything to prevent it, we've got a lot of idle pipelines.

Because those pipelines are "almost" an extra core, Hyperthreading kicks in and simulates an extra core to execute a parallel thread, but is not near in the same level as a real additional core because: 1. Not all parts of the pipeline are duplicated and 2. The main thread may actually start using those unused pipelines, leaving no spare components to dedicate to the hyperthreaded thread.
Intel claims Hyperthreading yields an average gain of 15% in performance.
Btw, if you open Task Manager in a single core system with hyperthreading, it will tell you there are two cores (the real one + fake one). In Intel Core i7 with 8 cores, it will tell you there are 16 cores (8 real ones + 8 fake ones)
Not all CPUs come with HT, in fact, most of them don't.
Just to add to this...the generic non-Intel term for this technique is "simultaneous multithreading", or "SMT" for short.
Correction: esp points to the last pushed value. When a value is pushed esp is decreased and then writes. When a value is popped the value is read and then esp is increased. I incorrectly stated that the increase/decrease would occur in such a way that would leave esp pointing to invalid memory. Also, interrupt only pushes some of the registers (including eflags). The interrupt routine is responsible for pushing the additional ones.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

Thanks Everyone. This stuff is pretty complex.

Alright Khatharr, you've signed yourself up for some questions.

Is cache management, like flushing a specific variable, is that manual?
Is interuption manual?
By manual I mean, could a compiler insert an instruction to do this?

Or is it all automatic?

And the important question, the main question I've been wondering about.

A compiler can't build code that works outside the OS?
Once the OS boots it takes control or what?

So an exe for example, does it start with some general notion of where the code starts at?
And I'm assuming it starts at a single point and can only branch out by controlling the scheduler?

Just explain to me how I would, in assembly, instruct the computer to carry out two different threads at once.
How I would say something basic like "do A, B, and C in seperate caches. when you've done A and B, do D."


If I could just understand how the computer knows what to multithread and how it communicates that, I'd be good.

But again, thanks everyone for the details. This will really help out my work!
It's crucial to my success to understand this, I just want you all to know.

Is cache management, like flushing a specific variable, is that manual?

Depending on the platform. You shouldn't have to mess with it in Windows, for instance. Platforms where it's useful will generally have API functions for managing it. It's only necessary when you have DMA going on and you have a situation where a DMA device may want to use memory that's cached on the CPU. It's actually quite rare, even in those situations. I think I only ever used it for the texture swizzle function and the texture hue-change function on aforementioned device. I just had to put a writeback call at the end of those functions and everything worked fine.

Interrupts can be signaled from asm. The instruction is INT on x86. http://pdos.csail.mit.edu/6.828/2004/readings/i386/s02_06.htm

(Edit - If you pick a fight with the thread scheduler the skeleton man will eat you.)

With the virtual memory model programs are more or less compiled as though they have exclusive access to the machine. The thread scheduler interacts with them from the outside. It's sort of like an emulator save-state. If you have the register states recorded and the memory recorded then you can restore both and resume execution as if there had been no interruption. VMM allows this because it can easily swap out pages for different processes, so each process basically has its own memory within the span of real-address-mode memory (or the swap file, depending on memory conditions). Since the stack for a thread is stored in that thread's stack page the register states can be pushed to that stack and then the stack page can be swapped to that of the next thread and the registers popped. This restores both the memory and the register states of the upcoming thread, so it continues as if nothing had ever happened.

When a PC boots it has a specific location that it looks to, usually on the primary disk volume, to find a startup module. The BIOS handles getting everything out of bed and then does basic loading to get the ball rolling. After that the startup module loads the OS and everything snowballs into place. On other platforms (like REDACTED that I was mentioning) the platform only runs one program at a time and the OS is just the first program to start. When it wants to run a program it places the path of the program in a specific place in memory then does a sort of soft-boot and the indicated program gets control of more or less the whole device.

Running an exe, the Windows loader does some pretty interesting black magic to map the different parts of the exe file into memory. The 'entry point' is calculated based on some values in structs near the beginning of the exe file. It's a pretty horrific mess in there TBH. You can read some about it here, but it may make you want to claw your eyes out. Basically the exe includes information about how big of a stack it needs and how big of a heap, etc. The loader gets it the pages it needs and then just sets eip to the entry point and lets it do what it wants. Sometimes it has to fiddle with some relative addresses within the file image in memory first, but VMM can usually eliminate that.
Just explain to me how I would, in assembly, instruct the computer to carry out two different threads at once.
How I would say something basic like "do A, B, and C in seperate caches. when you've done A and B, do D."

That's more basic concurrency than threading. You'd mean 'cores' or possibly 'fibers' there rather than 'caches'. This is actually a better concurrency model than threading in many cases.

I'd really like to see what you're talking about implemented in C++, though I hear rumors that some compilers have custom features for it.

Starting an actual thread in ASM is no different than doing so in C/C++. You have to invoke the operating system's functions for it. I think what you may want in this situation is just a task queue that feeds into more than one thread for execution, but I suspect that you may not get the results you're looking for. The thing about the Windows thread scheduler is that Windows usually has a pretty large number of threads running all over the place. It's not just the program in the foreground that's getting scheduled. The kernel needs to do work, programs in the background are doing work, etc, etc. So even if you split up work using threads you're not really getting pure concurrency since those threads will basically execute when they damn well please. Your work won't necessarily be paralleled and you'll probably end up waiting on results when you don't need to.

In other words, you may be a little ahead of the state of the art there, but people are working on it. If anyone knows about these rumored compilers that offer keywords for concurrency I'd like to know about them as well. biggrin.png
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

I would suggest to read the intel documentation about their processors. There are a bunch of books some dealing with system programming because this is what you are asking here. From the history of computers and their individual operating systems there are a lot of solutions for handling multiple execution pathes. Some used segmented memory management schemes noadays you have paged memory management in most places. Because you have so many questions how things work in detail that I even suggest to learn how the old things worked. They started with more simple architectures and started to extend them step by step. This way your understanding grows the same way.

I have an older documentation available but also some very old books about intel i486 and Motorola 68000 CPUs. These are old but so trivial that they are easy to understand.

Sorry to skim much of the answers here, so this might be redundant.

All threads can be running the same assembly code instructions simultaneously without a problem.

The code they're running will access some memory. This memory must be synchronized. Each thread has own registers for the code, but if the code shares the same memory, you must deal with it.

Running a couple of processes is the same management than running a couple of threads

The difference is that all threads of a process share the same data segment, except for thread local storage but that is another story, and can access and manipulate any other threads data.

This is why a synchronization mechanism must exist to not garble the data. Explained in this thread some posts ago.

The old way to get something like threads was to create a bunch of processes, each isolated in his own address space, and than create a shared memory segment that is used to share data in a direct way.

The difference is that managing the synchronization between processes needs more processor cycles that synchronizing a bunch of threads. Threads are comfortable. If you understand how the memory is organized in code, data and stack and how the security mechanisms work it is an easy thing to understand threads.

This topic is closed to new replies.

Advertisement