A bunch of low level programming questions....

Started by
10 comments, last by Promit 15 years, 9 months ago
I just began doing a little reading on some low level programming topics. I am pretty unfamiliar with this topic and as a result I have a number of questions I was hoping you guys could help me with: (1) How do we ensure CPU cache coherency in the presence of context switching? For example lets say user program A executes for 10us, then user program B executes for 10us. What methods are used to prevent B from trashing the cache? (2) Each processor specification specifies a fixed number of registers. However CPUs that implement those specifications in fact have many more registers than specified. How do those registers end up being used if we can only reference the standard registers in our assembly programs? (3) How do CPUs respond to instructions they do not understand? For example if an executable which utilized SSE instructions was run on a CPU which doesn't support that instruction set, would a CPU exception be raised? (4) Does a CPU instruction necessarily take 1 cycle, or can a CPU instruction take multiple cycles? (5) When writing an inline assembly within a C or C++ program, how do we know which registers are in use? For example, what if some of the registers are being used to hold function arguments? Or similarly, what happens to the register state before and after the inline block is initiated? Thanks in advance :)
Advertisement
I don't know about the rest, but I can answer 4).

An x86 CPU instruction can take a variable number of clock cycles to complete. In addition, there's usually a specified amount of cycles before that instruction can be called again. For example, the FSQRT instruction (floating-point sqrt) on the Core 2 Duo has a latency of 58 cycles (meaning it takes 58 cycles to complete) and also a throughput of 58 (meaning 58 cycles must elapse before it can be called again). In comparison, an integer add (ADD) has a latency of 1 cycle, throughput of 0.5 cycles. Integer division (IDIV) has a latency and throughput of 22 cycles. You can find more information from Intel here.
NextWar: The Quest for Earth available now for Windows Phone 7.
Quote:(1) How do we ensure CPU cache coherency in the presence of context switching? For example lets say user program A executes for 10us, then user program B executes for 10us. What methods are used to prevent B from trashing the cache?

Make sure that the threads are working on the same dataset. [grin] Seriously though, I don't know if this is the case on the PC, but I remember when I was working on optimization for a console game recently, and there was a point in which our biggest performance bottleneck was two hardware threads thrashing the L1 cache like crazy, so I'm not sure if there's a practical way to avoid the cache incoherency due to multiple threads outside of, as I mentioned, having both threads operate on (approximately) the same data.
Quote:Original post by Sc4Freak
I don't know about the rest, but I can answer 4).

An x86 CPU instruction can take a variable number of clock cycles to complete. In addition, there's usually a specified amount of cycles before that instruction can be called again. For example, the FSQRT instruction (floating-point sqrt) on the Core 2 Duo has a latency of 58 cycles (meaning it takes 58 cycles to complete) and also a throughput of 58 (meaning 58 cycles must elapse before it can be called again). In comparison, an integer add (ADD) has a latency of 1 cycle, throughput of 0.5 cycles. Integer division (IDIV) has a latency and throughput of 22 cycles. You can find more information from Intel here.

Yeah I know off the top of my head that 4 is false at least for most cpu's since about 1998 with the introduction of superscalar:

A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor

The main reason I know this is due to the bit of Assembly programming I did a while back and how the book I was reading(don't remember which exactly) mentioned that all the tables in the back of the Microsoft MASM manual that show how many cycles each ASM instruction takes were pretty much obsolete nowadays!
Even nowadays you'll hear of some hardcore assembly programmers that brag how they had memorized how many cycles each instruction took. Not much use now is it?
But other than taking a computer architecure course you can probably figure most of this stuff out from the Intel or AMD manuals as sc4freak mentioned, which you can download or even order in paper format for free.
[size="2"]Don't talk about writing games, don't write design docs, don't spend your time on web boards. Sit in your house write 20 games when you complete them you will either want to do it the rest of your life or not * Andre Lamothe
Quote:Original post by fpsgamer
(1) How do we ensure CPU cache coherency in the presence of context switching? For example lets say user program A executes for 10us, then user program B executes for 10us. What methods are used to prevent B from trashing the cache?

The OS and the hardware take care of this for you. If you are writing operating system code at that level, there is a lot of processor documentation and graduate-level literature that you should read.

Quote:(2) Each processor specification specifies a fixed number of registers. However CPUs that implement those specifications in fact have many more registers than specified. How do those registers end up being used if we can only reference the standard registers in our assembly programs?
Here's the process since the Pentium 2: The hardware decodes your instructions into micro-ops and maps data to the internal registers. Then the out-of-order core executes the micro-ops and other instructions in an arbitrary internal order, and does assorted other processing magic. Eventually, the code is retired from the processor's core, which restores it to the same logical registers you expected it to be.

Intel has a bunch of literature on the subject over at softwarecollege.intel.com. There are many interactive demos, tons of source code, and lots of tutorials for specific concepts.

In addition to the tons of free resources, they have classes for a relatively low fee.

Quote:(3) How do CPUs respond to instructions they do not understand? For example if an executable which utilized SSE instructions was run on a CPU which doesn't support that instruction set, would a CPU exception be raised?
Depends on the CPU.

The x86 processors (in protected mode) use a low-level task manager, which includes various task states. Read intel's processor manuals for details.
Quote:(4) Does a CPU instruction necessarily take 1 cycle, or can a CPU instruction take multiple cycles?

No. Read their processor manuals for details on the times of various operations.

Quote:(5) When writing an inline assembly within a C or C++ program, how do we know which registers are in use? For example, what if some of the registers are being used to hold function arguments? Or similarly, what happens to the register state before and after the inline block is initiated?
When using inline asm blocks, it is the compiler's job to place the code into a a well-known state. That state is implementation defined.

Be aware that there is a cost for inline assembly. The compiler must:
* save assorted states
* ensure registers contain the expected values
* insert your code (without any optimizations)
* restore any states that were transparently saved and hidden

For short blocks of code there is generally a net loss of performance when using inline assembly, unless you are very careful to profile before and after you write the code.


Rather than use inline asm directly, you should consider using intrinsics if they are available for the operations. Many compilers use the same name as intel's compiler intrinsics, which are also documented in the processor manuals. These typically begin with __mm, __m64, __m128, or similar names.
Quote:Original post by daviangel
Quote:Original post by Sc4Freak...An x86 CPU instruction can take a variable number of clock cycles to complete...

Yeah I know off the top of my head that 4 is false at least for most cpu's since about 1998 with the introduction of superscalar
It has been that way since forever.

The first two computers I had in-depth assembly experience with, Z80 (8-bit computer) and TMS9900 (first 16-bit home computer), were both from the mid 1970s and had different execution times for different instructions.

There wasn't any sort of pipeline, so you could do lots of fun tricks with the cpu timings of instructions, such as repeatedly calling different operations for different length loops. Some of the later chips used short instruction caches, which let you do a few fun things with self-modifing code. The most common of those was to identify which processor version you were using by modifying an instruction a few bytes ahead from addition to subtraction. If your number went up, you knew you were on a newer processor with the longer instruction cache. You could modify several of them and look at the result of all the additions/subtractions to identify the exact cache length, which told you the exact model of the CPU.

Before that, (this was before my day, but I heard stories from the actual developers) you would use various CPU instructions as delays to wait for the storage drum's head to spin to the right position. Of course, I was fortunate enough to have dual floppy drives so I never had that problem. ;-)
Quote:Original post by fpsgamer
I just began doing a little reading on some low level programming topics. I am pretty unfamiliar with this topic and as a result I have a number of questions I was hoping you guys could help me with:


Get yourself a book if you're interested in this sort of thing. Computer Organization and Design and Computer Architecture both by Hennessy and Patterson are good ones.

Quote:Original post by fpsgamer
(1) How do we ensure CPU cache coherency in the presence of context switching? For example lets say user program A executes for 10us, then user program B executes for 10us. What methods are used to prevent B from trashing the cache?


B does trash the cache. On some processors B trashes the entire cache, i.e. context switches completely invalidates the cache. On others B only trashes what it uses, i.e. only the cache lines B touches are invalidated with respect to A.

Quote:Original post by fpsgamer
(2) Each processor specification specifies a fixed number of registers. However CPUs that implement those specifications in fact have many more registers than specified. How do those registers end up being used if we can only reference the standard registers in our assembly programs?


By using register renaming and variations on Tomasulo's algorithm.

Quote:Original post by fpsgamer
(3) How do CPUs respond to instructions they do not understand? For example if an executable which utilized SSE instructions was run on a CPU which doesn't support that instruction set, would a CPU exception be raised?


Yes, in general.

Quote:Original post by fpsgamer
(4) Does a CPU instruction necessarily take 1 cycle, or can a CPU instruction take multiple cycles?


Yes they can take multiple cycles, and often do. By using pipelined functional units you can get a throughput of 1 instruction per cycle, but that doesn't necessarily mean each instruction is executed in 1 cycle.

Quote:Original post by fpsgamer
(5) When writing an inline assembly within a C or C++ program, how do we know which registers are in use? For example, what if some of the registers are being used to hold function arguments? Or similarly, what happens to the register state before and after the inline block is initiated?


Either the compiler has to work around your register usage by identifying registers you use and generating code that saves and restores them around your block (MSVC), or you have to explicitly tell it which registers you kill (GCC, XL, others).
Computer Architecture: A quantitative approach is a pretty decent introductory book for graduate level architecture, and covers everything short of the subjects that are currently the topics of modern research papers.

In all honesty though, don't buy it unless you really want to get into architecture stuff [or even compiler programming] and even then buy it used. The information contained inside of it is largely useless for someone who wants even the thinnest abstraction between them and the metal. Nothing in this book will change how you program, but it'll explain what all the pixies, gnomes, and faeries are doing with your code once the compiler finishes up with it. The only effect it'll have on you is likely to expose just how complicated modern machine architectures are, and you'll never write your code with 'this will take 2 cycles, this will take 1 cycle' running through your head again [which is a practice that should be rattled from you if you've had any exposure to compilers as well].

With that said, if you ARE interested in it, it is a widely used book in computer science graduate level architecture courses, which means there are lots of used ones around.
Quote:Original post by frob
Quote:Original post by fpsgamer
(1) How do we ensure CPU cache coherency in the presence of context switching? For example lets say user program A executes for 10us, then user program B executes for 10us. What methods are used to prevent B from trashing the cache?

The OS and the hardware take care of this for you. If you are writing operating system code at that level, there is a lot of processor documentation and graduate-level literature that you should read.


The OS will indeed take care of this for you, but often poorly. For example if thread #1 and thread #2 are continuously updating adjacent elements array[1] and array[2], the OS and/or processor may take locks in the various cache levels to make sure that the writes are seen by all processors, even if they only touch their respective elements.

Related video of a presentation by Herb Sutter with some examples. (skip to around 1:18 for a specific example). And here's a DDJ article.

It's sad, but you really do need to know a little more about your target system to write efficient threaded code.
Thanks for the great replies so far :) In the meantime I managed to come up with a couple of new curiosities:

(1) What kind of addresses does the CPU cache store in the presence of virtual memory and many different user processes? I don't think it stores physical addresses because the correct page at that location may not be paged in. I don't think it can be virtual addresses because those are duplicate across every process.

(2) When a context switch occurs the operating system saves all the registers of the currently executing user process. AFIK the OS only stores the standard registers. What happens to the many CPU specific registers that may be in use at the time?

(3) Just the other day I took at shot at using intrinsics under GCC. However all the intrinsic functions I wrote performed worse than my high level C++ implementations. Was the problem due to my lack of knowledge, or are GCC's intrinsics not ready for prime time?

This topic is closed to new replies.

Advertisement