Jump to content
  • Advertisement
  • 05/16/13 03:09 PM

    A Journey Through the CPU Pipeline

    General and Gameplay Programming
       (1 review)

    frob
    • Posted By frob
    It is good for programmers to understand what goes on inside a processor. The CPU is at the heart of our career. What goes on inside the CPU? How long does it take for one instruction to run? What does it mean when a new CPU has a 12-stage pipeline, or 18-stage pipeline, or even a "deep" 31-stage pipeline? Programs generally treat the CPU as a black box. Instructions go into the box in order, instructions come out of the box in order, and some processing magic happens inside. As a programmer, it is useful to learn what happens inside the box. This is especially true if you will be working on tasks like program optimization. If you don't know what is going on inside the CPU, how can you optimize for it? This article is about what goes on inside the x86 processor's deep pipeline.

    Stuff You Should Already Know

    First, this article assumes you know a bit about programming, and maybe even a little assembly language. If you don't know what I mean when I mention an instruction pointer, this article probably isn't for you. When I talk about registers, instructions, and caches, I assume you already know what they mean, can figure it out, or will look it up. Second, this article is a simplification of a complex topic. If you feel I have left out important details, please add them to the comments at the end. Third, I am focusing on Intel processors and the x86 family. I know there are many different processor families out there other than x86. I know that AMD introduced many useful features into the x86 family and Intel incorporated them. It is Intel's architecture and Intel's instruction set, and Intel introduced the most major feature being covered, so for simplicity and consistency I'm just going to stick with their processors. Fourth, this article is already out of date. Newer processors are in the works and some are due out in a few months. I am very happy that technology is advancing at a rapid pace. I hope that someday all of these steps are completely outdated, replaced with even more amazing advances in computing power.

    The Pipeline Basics

    From an extremely broad perspective the x86 processor family has not changed very much over its 35 year history. There have been many additions but the original design (and nearly all of the original instruction set) is basically intact and visible in the modern processor. The original 8086 processor has 14 CPU registers which are still in use today. Four are general purpose registers -- AX, BX, CX, and DX. Four are segment registers that are used to help with pointers -- Code Segment (CS), Data Segment (DS), Extra Segment (ES), and Stack Segment (SS). Four are index registers that point to various memory locations -- Source Index (SI), Destination Index (DI), Base Pointer (BP), and Stack Pointer (SP). One register contains bit flags. And finally, there is the most important register for this article: The Instruction Pointer (IP). The instruction pointer register is a pointer with a special job. The instruction pointer's job is to point to the next instruction to be run. All processors in the x86 family follow the same pattern. First, they follow the instruction pointer and decode the next CPU instruction at that location. After decoding, there is an execute stage where the instruction is run. Some instructions read from memory or write to it, others perform calculations or comparisons or do other work. When the work is done, the instruction goes through a retire stage and the instruction pointer is modified to point to the next instruction. This decode, execute, and retire pipeline pattern applies to the original 8086 processor as much as it applies to the latest Core i7 processor. Additional pipeline stages have been added over the years, but the pattern remains.

    What Has Changed Over 35 Years

    The original processor was simple by today's standard. The original 8086 processor began by evaluating the instruction at the current instruction pointer, decoded it, executed it, retired it, and moved on to the next instruction that the instruction pointer pointed to. Each new chip in the family added new functionality. Most chips added new instructions. Some chips added new registers. For the purposes of this article I am focusing on the changes that affect the main flow of instructions through the CPU. Other changes like adding virtual memory or parallel processing are certainly interesting and useful, but not applicable to this article. In 1982 an instruction cache was added to the processor. Instead of jumping out to memory at every instruction, the CPU would read several bytes beyond the current instruction pointer. The instruction cache was only a few bytes in size, just large enough to fetch a few instructions, but it dramatically improved performance by removing round trips to memory every few cycles. In 1985, the 386 added cache memory for data as well as expanding the instruction cache. This gave performance improvements by reading several bytes beyond a data request. By this point both the instruction cache and data cache were measured in kilobytes rather than bytes. In 1989, the i486 moved to a five-stage pipeline. Instead of having a single instruction inside the CPU, each stage of the pipeline could have an instruction in it. This addition more than doubled the performance of a 386 processor of the same clock rate. The fetch stage extracted an instruction from the cache. (The instruction cache at this time was generally 8 kilobytes.) The second stage would decode the instruction. The third stage would translate memory addresses and displacements needed for the instruction. The fourth stage would execute the instruction. The fifth stage would retire the instruction, writing the results back to registers and memory as needed. By allowing multiple instructions in the processor at once, programs could run much faster. 1993 saw the introduction of the Pentium processor. The processor family changed from numbers to names as a result of a lawsuit--that's why it is Pentium instead of the 586. The Pentium chip changed the pipeline even more than the i486. The Pentium architecture added a second separate superscalar pipeline. The main pipeline worked like the i486 pipeline, but the second pipeline ran some simpler instructions, such as direct integer math, in parallel and much faster. In 1995, Intel released the Pentium Pro processor. This was a radically different processor design. This chip had several features including out-of-order execution processing core (OOO core) and speculative execution. The pipeline was expanded to 12 stages, and it included something termed a 'superpipeline' where many instructions could be processed simultaneously. This OOO core will be covered in depth later in the article. There were many major changes between 1995 when the OOO core was introduced and 2002 when our next date appears. Additional registers were added. Instructions that processed multiple values at once (Single Instruction Multiple Data, or SIMD) were introduced. Caches were introduced and existing caches enlarged. Pipeline stages were sometimes split and sometimes consolidated to allow better use in real-world situations. These and other changes were important for overall performance, but they don't really matter very much when it comes to the flow of data through the chip. In 2002, the Pentium 4 processor introduced a technology called Hyper-Threading. The OOO core was so successful at improving processing flow that it was able to process instructions faster than they could be sent to the core. For most users the CPU's OOO core was effectively idle much of the time, even under load. To help give a steady flow of instructions to the OOO core they attached a second front-end. The operating system would see two processors rather than one. There were two sets of x86 registers. There were two instruction decoders that looked at two sets of instruction pointers and processed both sets of results. The results were processed by a single, shared OOO core but this was invisible to the programs. Then the results were retired just like before, and the instructions were sent back to the two virtual processors they came from. In 2006, Intel released the "Core" microarchitecture. For branding purposes, it was called "Core 2" (because everyone knows two is better than one). In a somewhat surprising move, CPU clock rates were reduced and Hyper-Threading was removed. By slowing down the clock they could expand all the pipeline stages. The OOO core was expanded. Caches and buffers were enlarged. Processors were re-designed focusing on dual-core and quad-core chips with shared caches. In 2008, Intel went with a naming scheme of Core i3, Core i5, and Core i7. These processors re-introduced Hyper-Threading with a shared OOO core. The three different processors differed mainly by the size of the internal caches. Future Processors: The next microarchitecture update is currently named Haswell and speculation says it will be released late in 2013. So far the published docs suggest it is a 14-stage OOO core pipeline, so it is likely the data flow will still follow the basic design of the Pentium Pro. So what is all this pipeline stuff, what is the OOO core, and how does it help processing speed?

    CPU Instruction Pipelines

    In the most basic form described above, a single instruction goes in, gets processed, and comes out the other side. That is fairly intuitive for most programmers. The i486 has a 5-stage pipeline. The stages are - Fetch, D1 (main decode), D2 (secondary decode, also called translate), EX (execute), WB (write back to registers and memory). One instruction can be in each stage of the pipeline.
    pipeline_superscalar.PNG
    There is a major drawback to a CPU pipeline like this. Imagine the code below. Back before CPU pipelines the following three lines were a common way to swap two variables in place. XOR a, b XOR b, a XOR a, b The chips starting with the 8086 up through the 386 did not have an internal pipeline. They processed only a single instruction at a time, independently and completely. Three consecutive XOR instructions is not a problem in this architecture. We'll consider what happens in the i486 since it was the first x86 chip with an internal pipeline. It can be a little confusing to watch many things in motion at once, so you may want to refer back to the diagram above. The first instruction enters the Fetch stage and we are done with that step. On the next step the first instruction moves to D1 stage (main decode) and the second instruction is brought into fetch stage. On the third step the first instruction moves to D2 and the second instruction gets moved to D1 and another is fetched. On the next stage something goes wrong. The first instruction moves to EX ... but other instructions do not advance. The decoder stops because the second XOR instruction requires the results of the first instruction. The variable (a) is supposed to be used by the second instruction, but it won't be written to until the first instruction is done. So the instructions in the pipeline wait until the first instruction works its way through the EX and WB stages. Only after the first instruction is complete can the second instruction make its way through the pipeline. The third instruction will similarly get stuck, waiting for the second instruction to complete. This is called a pipeline stall or a pipeline bubble. Another issue with pipelines is some instructions could execute very quickly and other instructions would execute very slowly. This was made more visible with the Pentium's dual-pipeline system. The Pentium Pro introduced a 12-stage pipeline. When that number was first announced there was a collective gasp from programmers who understood how the superscalar pipeline worked. If Intel followed the same design with a 12-stage superscalar pipeline then a pipeline stall or slow instruction would seriously harm execution speed. At the same time they announced a radically different internal pipeline, calling it the Out Of Order (OOO) core. It was difficult to understand from the documentation, but Intel assured developers that they would be thrilled with the results. Let's have a look at this OOO core pipeline in more depth.

    The Out Of Order Core Pipeline

    The OOO Core pipeline is a case where a picture is worth a thousand words. So let's get some pictures.

    Diagrams of CPU Pipelines

    The i486 had a 5-stage pipeline that worked well. The idea was very common in other processor families and works well in the real world.
    pipeline_486.PNG
    The Pentium pipeline was even better than the i486. It had two instruction pipelines that could run in parallel, and each pipeline could have multiple instructions in different stages. You could have nearly twice as many instructions being processed at the same time.
    pipeline_586.PNG
    Having fast instructions waiting for slow instructions was still a problem with parallel pipelines. Having sequential instruction order was another issue thanks to stalls. The pipelines are still linear and can face a performance barrier that cannot be breached. The OOO core is a huge departure from the previous chip designs with their linear paths. It added some complexity and introduced nonlinear paths:
    pipeline_OOO.PNG
    The first thing that happens is that instructions are fetched from memory into the processor's instruction cache. The decoder on the modern processors can detect when a large branch is about to happen (such as a function call) and can begin loading the instructions before they are needed. The decoding stage was modified slightly from earlier chips. Instead of just processing a single instruction at the instruction pointer, the Pentium Pro processor could decode up to three instructions per cycle. Today's (circa 2008-2013) processors can decode up to four instructions at once. Decoding produces small fragments of operations called micro-ops or u-ops. Next is a stage (or set of stages) called micro-op translation, followed by register aliasing. Many operations are going on at once and we will potentially be doing work out of order, so an instruction could read to a register at the same time another instruction is writing to it. Writing to a register could potentially stomp on a value that another instruction needs. Inside the processor the original registers (such as AX, BX, CX, DX, and so on) are translated (or aliased) into internal registers that are hidden from the programmer. The registers and memory addresses need to have their values mapped to a temporary value for processing. Currently 4 micro-ops can go through translation every cycle. After micro-op translation is complete, all of the instruction's micro-ops enter a reorder buffer, or ROB. The ROB currently holds up to 128 micro-ops. On a processor with Hyper-Threading the ROB can also coordinate entries from multiple virtual processors. Both virtual processors come together into a single OOO core at the ROB stage. These micro-ops are now ready for processing. They are placed in the Reservation Station (RS). The RS currently can hold 36 micro-ops at any one time. Now the magic of the OOO core happens. The micro-ops are processed simultaneously on multiple execution units, and each execution unit runs as fast as it can. Micro-ops can be processed out of order as long as their data is ready, sometimes skipping over unready micro-ops for a long time while working on other micro-ops that are ready. This way a long operation does not block quick operations and the cost of pipeline stalls is greatly reduced. The original Pentium Pro OOO core had six execution units: two integer processors, one floating-point processor, a load unit, a store address unit, and a store data unit. The two integer processors were specialized; one could handle the complex integer operations, the other could solve up to two simple operations at once. In an ideal situation the Pentium Pro OOO Core execution units could process seven micro-ops in a single clock cycle. Today's OOO core still has six execution units. It still has the load address, store address, and store data execution units, the other three have changed somewhat. Each of the three execution units perform basic math operations, or instead they perform a more complex micro-op. Each of the three execution units are specialized to different micro-ops allowing them to complete the work faster than if they were general purpose. In an ideal situation today's OOO core can process 11 micro-ops in a single cycle. Eventually the micro-op is run. It goes through a few more small stages (which vary from processor to processor) and eventually gets retired. At this point it is returned back to the outside world and the instruction pointer is advanced. From the program's point of view the instruction has simply entered the CPU and exited the other side in exactly the same way it did back on the old 8086. If you were following carefully you may have noticed one very important issue in the way it was just described. What happens if there is a change in execution location? For example, what happens when the code hits an 'if' statement or a 'switch" statement? On the older processors this meant discarding the work in the superscalar pipeline and waiting for the new branch to begin processing. A pipeline stall when the CPU holds one hundred instructions or more is an extreme performance penalty. Every instruction needs to wait while the instructions at the new location are loaded and the pipeline restarted. In this situation the OOO core needs to cancel work in progress, roll back to the earlier state, wait until all the micro-ops are retired, discard them and their results, and then continue at the new location. This was a very difficult problem and happened frequently in the design. The performance of this situation was unacceptable to the engineers. This is where the other major feature of the OOO core comes in. Speculative execution was their answer. Speculative execution means that when a conditional statement (such as an 'if' block) is encountered the OOO core will simply decode and run all the branches of the code. As soon as the core figures out which branch was the correct one, the results from the unused branches would be discarded. This prevents the stall at the small cost of running the code inside the wrong branch. The CPU designers also included a branch prediction cache which further improved the results when it was forced to guess at multiple branch locations. We still have CPU stalls from this problem, but the solutions in place have reduced it to the point where it is a rare exception rather than a usual condition. Finally, CPUs with Hyper-Threading enabled will expose two virtual processors for a single shared OOO core. They share a Reorder Buffer and OOO core, appearing as two separate processors to the operating system. That looks like this:
    pipeline_OOO_HT.PNG
    A processor with Hyper-Threading gives two virtual processors which in turn gives more data to the OOO core. This gives a performance increase during general workloads. A few compute-intensive workflows that are written to take advantage of every processor can saturate the OOO core. During those situations Hyper-Threading can slightly decrease overall performance. Those workflows are relatively rare; Hyper-Threading usually provides consumers with approximately double the speed they would see for their everyday computer tasks.

    An Example

    All of this may seem a little confusing. Hopefully an example will clear everything up. From the application's perspective, we are still running on the same instruction pipeline as the old 8086. There is a black box. The instruction pointed to by the instruction pointer is processed by the black box, and when it comes out the results are reflected in memory. From the instruction's point of view, however, that black box is quite a ride. Here is today's (circa 2008-2013) CPU ride, as seen by an instruction: First, you are a program instruction. Your program is being run. You are waiting patiently for the instruction pointer to point to you so you can be processed. When the instruction pointer gets about 4 kilobytes away from you -- about 1500 instructions away -- you get collected into the instruction cache. Loading into the cache takes some time, but you are far away from being run. This prefetch is part of the first pipeline stage. The instruction pointer gets closer and closer. When the instruction pointer gets about 24 instructions away, you and five neighbors get pulled into the instruction queue. This processor has four decoders. It has room for one complex instruction and up to three simple instructions. You happen to be a complex instruction and are decoded into four micro-ops. Decoding is a multi-step process. Part of the decode process involved a scan to see what data you need and if you are likely to cause a jump to somewhere new. The decoder detected a need for some additional data. Unknown to you, somewhere on the far side of the computer, your data starts getting loaded into the data cache. Your four micro-ops step up to the register alias table. You announce which memory address you read from (it happens to be fs:[eax+18h]) and the chip translates that into temporary addresses for your micro-ops. Your micro-ops enter the reorder buffer, or ROB. At the first available opportunity they move to the Reservation Station. The Reservation Station holds instructions that are ready to be run. Your third micro-op is immediately picked up by Execution Port 5. You don't know why it was selected first, but it is gone. A few cycles later your first micro-op rushes to Port 2, the Load Address execution unit. The remaining micro-ops wait as various ports collect other micro-ops. They wait as Port 2 loads data from the memory cache and puts it in temporary memory slots. They wait a long time... A very long time... Other instructions come and go while they wait for their micro-op friend to load the right data. Good thing this processor knows how to handle things out of order. Suddenly both of the remaining micro-ops are picked up by Execution Ports 0 and 1. The data load must be complete. The micro-ops are all run, and eventually the four micro-ops meet back in the Reservation Station. As they travel back through the gate the micro-ops hand in their tickets listing their temporary addresses. The micro-ops are collected and joined, and you, as an instruction, feel whole again. The CPU hands you your result, and gently directs you to the exit. There is a short line through a door marked "Retirement". You get in line, and discover you are standing next to the same instructions you came in with. You are even standing in the same order. It turns out this out-of-order core really knows its business. Each instruction then goes out of the CPU, seeming to exit one at a time, in the same order they were pointed to by the instruction pointer.

    Conclusion

    This little lecture has hopefully shed some light on what happens inside a CPU. It isn't all magic, smoke, and mirrors. Getting back to the original questions, we now have some good answers. What goes on inside the CPU? There is a complex world where instructions are broken down into micro-operations, processed as soon as possible in any order, then put back together in order and in place. To an outsider it looks like they are being processed sequentially and independently. But now we know that on the inside they are handled out of order, sometimes even running braches of code based on a prediction that they will be useful. How long does it take for one instruction to run? While there was a good answer to this in the non-pipelined world, in today's processors the time it takes is based on what instructions are nearby, and the size and contents of the neighboring caches. There is a minimum amount of time it takes to go through the processor, but that is roughly constant. A good programmer and optimizing compiler can make many instructions run in around amortized zero time. With an amortized zero time it is not the cost of the instruction that is slow; instead it means it takes the time to work through the OOO core and the time to wait for cache memory to load and unload. What does it mean when a new CPU has a 12-stage pipeline, or 18-stage, or even a "deep" 31-stage pipeline? It means more instructions are invited to the party at once. A very deep pipeline can mean that several hundred instructions can be marked as 'in progress' at once. When everything is going well the OOO core is kept very busy and the processor gains an impressive throughput of instructions. Unfortunately, this also means that a pipeline stall moves from being a mild annoyance like it was in the early days, to becoming a performance nightmare as hundreds of instructions need to wait around for the pipeline to clear out. How can I apply this to my programs? The good news is that CPUs can anticipate most common patterns, and compilers have been optimizing for OOO core for nearly two decades. The CPU runs best when instructions and data are all in order. Always keep your code simple. Simple and straightforward code will help the compiler's optimizer identify and speed up the results. Don't jump around if you can help it, and if you need to jump, try to jump around exactly the same way every time. Complex designs like dynamic jump tables are fancy and can do a lot, but neither the compiler or CPU will predict what will be coming up, so complex code is very likely to result in stalls and mis-predictions. On the other side, keep your data simple. Keep your data in order, adjacent, and consecutive to prevent data stalls. Choosing the right data structures and data layouts can do much to encourage good performance. As long as you keep your code and data simple you can generally rely on your compiler's optimizer to do the rest. Thanks for joining me on the ride. updates 2013-05-17 Removed a word that someone was offended by


      Report Article


    User Feedback


    There is still a case where you can be hit having to deal with an In Order CPU in games development and this is when you are writing code for a PS3 or X360(I know these aren't intel CPU's), just wanted to mention this otherwise very good article and very informative.

    Share this comment


    Link to comment
    Share on other sites
    The Playstation's Cell processor might be a fun one to write up, although it makes the x86 processor look easy.

    The x86 uses a "superpipeline" as described above. The Cell family uses a "synergistic" approach involving nine mini-cpus.

    It is true that each mini-cpu follows a mostly in-order pipeline, the mini-cpus have multiple parallel superscalar pipelines rather than a single pipeline.

    Among the mini-cpus there is one processor, the PPE, that does managerial work and operates as a command-and-control facility. There are 8 other processors, the SPEs, which are also fully-featured processors, and they do the work assigned by the PPE.

    Where the x86 can have at most six internal execution units running at once in addition to having instructions at various locations in the pipeline, the Cell can have over 50 execution units running simultaneously in addition to the array of deep pipelines on each mini-cpu. The combined piplines of all the processors with their parallel pipelines can collectively hold the equivalent of about a thousand x86 micro-ops in their buffers at any time.

    If x86 is a house party where a hundred or so instructions are invited, the Cell is a city block party.

    The Cell processor has been dubbed a "supercomputer on a chip" because of the many interconnected microprocessors.

    ... And now I'm writing an article again. I'll stop now.

    Since the PS4 and XBox720 are due out soon, talking about the older Cell processors would really just be a historical exercise.

    Share this comment


    Link to comment
    Share on other sites

    Great article.  It finally taught me how speculative branching and branch prediction works!

    Share this comment


    Link to comment
    Share on other sites

    the i486 added superscalar pipelining.

    Superscalar means >1 instruction issue per cycle. So Pentium is superscalar, but 486 is just a pipelined scalar.

     

    For most users the CPU’s OOO core was effectively idle much of the time, even under load.

    It would be useless to have OoO if that was true. They mainly idle in cases of branch misprediction, cache miss, lack of parallelism in instruction stream or specific instruction mix (say, only integer instructions). Second thread could supply additional instructions to fill available compute resources. 

     

    To help give a steady flow of instructions to the OOO core they attached a second front-end.

    None of SMT processors has entire frontend duplicated.

    Mainly queues, TLBs and tags here and there. Decoder is usually shared between threads and accessed in alternating cycles, or coarser granularity.

    In case of P4, both decoder and trace cache are shared.

    http://equipe.nce.ufrj.br/gabriel/arqcomp2/Hyperthread.pdf

    Share this comment


    Link to comment
    Share on other sites
    Interesting.

    For the first one, since the word bothers you I'll remove it.

    For the second point, I really wish that were true in my experience. When I work on optimizing our code bases, I always get discouraged when vtune shows just how much time the CPU is sitting idle. Even when going through tight loops the CPU is mostly just sitting there waiting for the cache. I'm glad to hear your experiences are different than what I see and measure in our game engines.

    On the third point, I figured that was just an implementation detail. Even you point out that it is different between the various chips. I didn't mention it, just like the details about returning back to the ROB before retirement (which in my view is much more important because the pipeline actually loops back on itself) and was included in early drafts, but ultimately left out for clarity and simplicity. Looks like you think it is important, so thanks for sharing!

    Share this comment


    Link to comment
    Share on other sites

    In 2006, Intel released the "Core" microarchitecture. For branding purposes, it was called "Core 2" (because everyone knows two is better than one).

     

    That's not right. The The "Core Solo" and "Core Duo" followed the Pentium M and were succeeded by the "Core 2 Duo".

    Share this comment


    Link to comment
    Share on other sites

    For the first one, the chips included an internal CPU clock mutliplier. 

     

    External clock is irrelevant, because pipeline frequency is the one that matters. 486 pipeline is still single-issue, hen?e scalar.

     

    Either way it is just a semantic argument about something that was superseded almost two decades ago.

     

    From the point of view of Intel engineers and the rest of world, Pentium was first superscalar Intel processor.

    Your post is the only one in brief google search that states otherwise and I don't see any point in that. Unless you started the original debate in the magazine you mentioned =)

    Share this comment


    Link to comment
    Share on other sites

    Awsome, I loved the part where we get the first person point of view of an instruction. We really felt dealt with by a large institution design to flush people cases with lots of agents working for you behind the counters and all... nice :)

    Just I need to +1 Demonkoryu because Pentium M was super important in recent intel history, and they actually used the Pentium Pro design as a base, and incorporated modern and powerfuls speculative systems from the P4. So back to a shorter pipeline depth to regain raw performance rather than the marketing driven disaster that was the P4. Its deep pipeline was just an unfortunate necessity for the will to rise frequency from marketing department.

    Share this comment


    Link to comment
    Share on other sites

    @Frob:  The PS4 is going to have an Octo-core x86-64 based processor, probably less powerful than the cell processor if I had to guess, but way more practical (and cheaper).  The thing with the cell processor, at least in the PS3, was that actually writing code that can take advantage of such massive parallelism is incredibly difficult.  Heck, most common programs on a typical PC are only single-core because either there is no need to parallelize or even when there is because its incredibly difficult to write really good parallel code.  I don't know for certain, but I'd be comfortable making the bet that not a single mainstream game on the PS3 came anywhere near utilizing the proc's full capabilities... and I doubt any of them really needed to.

     

    C++11 has, theoretically made multi-threading rather easier, and the .NET framework does provide some pretty good, simple parallelism tools for simple cases (ie:  Parallel.ForEach and a fairly effective automatic work stealing task architecture) but even then writing good parallel code for complex systems (like games) is still quite difficult.

     

    What I'd love to see, one of these years, is some sort of automatic multithreading on the part of compilers or perhaps operating systems which would perhaps analyze the program code and automatically break up its instructions into discrete tasks/join them as appropriate without requiring the developer to basically explode trying to juggle it all (Elements of this are already possible, but it will likely be a long time before it is really mature for complex things, assuming progress marches in that particular direction)

    Share this comment


    Link to comment
    Share on other sites
    Yes, it is a pain to work with the PS3. Several of my game credits were on cross-platform games that included the PS3. The Cell architecture is fun to play with, but hard to work with. It is far better than the PS2 was. The PS2's processors were neither fun for play or work.

    As for the PS4, it isn't released yet so to say anything about it would likely be a violation of NDAs.

    Share this comment


    Link to comment
    Share on other sites

    What I'd love to see, one of these years, is some sort of automatic multithreading on the part of compilers or perhaps operating systems which would perhaps analyze the program code and automatically break up its instructions into discrete tasks/join them as appropriate without requiring the developer to basically explode trying to juggle it all (Elements of this are already possible, but it will likely be a long time before it is really mature for complex things, assuming progress marches in that particular direction)

     

    While it would be nice to see automatic multithreading, that particular advancement is one of the smallest gains you can get.  

     

    To do it right you need to completely understand your algorithm, partition the problem in to the correct work units (which is a hard task), study the actual communication between tasks (which is also hard), agglomerate the work units into communication clusters, then map the data back to individual threads and CPUs. Unlike an automated facility for multithreading bits of obviously-parallel work, the PCAM process will usually transform the algorithm and give a significant speedup.

     

    I trust the compiler and optimizer to make pigeonhole optimizations to automatically parallelize and vectorize loops and other small blocks of my code. That doesn't bother me. I do not trust compilers and optimizers to completely transform my linear algorithm into a parallel-processing friendly format. That requires a level of understanding far beyond anything normally done in compilers.

    Share this comment


    Link to comment
    Share on other sites

    nifty...

     

    though this does make me wonder some how the Intel and AMD CPUs are similar or differ on some of these points (I am personally mostly using AMD chips).

    Share this comment


    Link to comment
    Share on other sites

    The AMD pipline is similar, but not identical.

     

    I found one write up on hothardware.com that provides an image similar to mine above, but focused on AMD hardware.  Their article has a different focus so they include more details on the cache (their purple, green, and violet areas), but otherwise you'll notice the model is similar.  

     

    It still follows the same pattern:  Instruction cache --> fetch --> decode  --> alias (they call rename)  --> schedule --> process (they have 2 integer, 2 floating point, one load address, and one store address processor compared to the six above for Intel) --> retire.  They drew the ROB off to the side rather than as a step, but that is an implementation detail. The major difference is that AMD's floating point hardware is separate from the integer hardware, where Intel keeps them together, and that the usage of internal processors is slightly different. (They both still have six). Exactly how the difference affects performance varies from program to program.

     
    BobcatDetail1.jpg

    Share this comment


    Link to comment
    Share on other sites

    mostly it is just that over the years I had done some optimizations based on experience, and I am left wondering if the performance behavior would turn out differently on other chips.

     

    mostly it is occasional code (typically within audio/video codecs) which does things like avoid integer<->floating point conversions (typically by using fixed-point arithmetic) and also sometimes avoiding use of unpredictable conditionals (by instead using arithmetic to figure out which results are used), ...

     

    granted, not that any of this is unusual (it seems that fixed-point is pretty much standard in A/V codecs anyways...).

    Share this comment


    Link to comment
    Share on other sites

    The Cell processor has been dubbed a "supercomputer on a chip" because of the many interconnected microprocessors.

    ... And now I'm writing an article again. I'll stop now.

    Since the PS4 and XBox720 are due out soon, talking about the older Cell processors would really just be a historical exercise.

    I'd glad read that historical exercise :D

    Share this comment


    Link to comment
    Share on other sites
    Guest d33tah

    Posted

    I just signed up here just to comment on this thing - it's awesome! Do I understand it correctly that the thing we perceive as a processor is pretty much just a "wrapper" chip for many more complicated, specialized chips that do logic, arithmetic, memory management and stuff? Anyway, I'll never look at the CPU the same way again now ;) When I first read the title of the article, the term "journey" kind of put me off as I got used to overrated articles using it to catch attention. This... really was a journey, especially the "example" section. I loved it. Thank you! :)

    Share this comment


    Link to comment
    Share on other sites

    This article is great!

    Can I forward and translate to Chinese this article to my blog?

    Share this comment


    Link to comment
    Share on other sites


    Create an account or sign in to comment

    You need to be a member in order to leave a comment

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

  • Advertisement
  • Advertisement
  • Latest Featured Articles

  • Featured Blogs

  • Advertisement
  • Popular Now

  • Similar Content

    • By slonv23
      I'm looking for programmers to take part in my spare time project. I am a web programmer, but interested in gamedev. I've decided to make my own game which I think will help me to understand underlying concepts of the game programming. I've chosen to create a 3d space shooter, something similar to space battles in star wars battlefront. The tools I am using are javascript and three.js library. It is not a big deal to understand javascript if you know some low level languages such as C++ or C. So far I've implemented basic movement of a spaceship and a camera. In future I'm planning to add a multiplayer. Contact me if you are interested in what I'm doing and you are the same newbie to the game programming as I. I can share some video of the application I have so far.
    • By fl05an17
      Hi guys, how are you? This question is addressed to those who have studied programming or programming logic or algorithms, also for those who want to enter this world, if they wanted to develop the logic of programming through a video game, concepts such as variables, cycles, conditional... etc, as you would like that game to be, what style of game would you prefer, dynamic, mechanical that you would imagine to cover these concepts in the game
    • By kkwiatko
      Dear Developers,   My name is Kevin Kwiatkowski and I am 22 years old. I am studying translation at Brussels University in Belgium. Next year, as a part of my master's studies, I’ll be writing my thesis. My university department gives me the opportunity to make a localisation of a video game as a dissertation. As I aspire to localise video games in the future, I would like to seize this opportunity and to localise a video game into French.   I'm looking for someone who would be interested in collaborating with me on this project. The game should have text parts and interface parts to be translated and my translation would preferably appear in the final game version.   If you wish to help the localization sector to progress don’t hesitate to contact me, my teacher and I would be very happy to work with game developers.   I can assure you, this work will serve an academic purpose and will be totally free and won’t be used for any personal or illegal purpose.   I look forward to hearing from you.   Best regards,   Kwiatkowski Kevin
    • By JimVsHumanity
      (Image by an artist, who's now deleted their twitter unfortunately)
      Foreword
      I am building a game project that includes a massive amount of characters and character writing. Writing is not one of my strong skills (as I'm sure you will see in this article) so I meet with skilled writers. I find their feedback invaluable.
      These meetings have taken place on rushed streets, swigging pints in pubs, through emails and through arguments. I'm confident I have met with a range of people with different opinions, but out of it all there are a few very consistent points which seem to form the backbone of writing characters.
      What follows are some notes on character writing that I have collected from these meetings. Complimented by the book Into The Woods by John Yorke (Notes supplemented by Into The Woods will be marked with [ITW]). Hopefully they can be of use of you.
      Character Versus Characterization
      The conflict between how we wish to be perceived and what we really feel is at the root of all character [ITW]. To see it another way:
      The conflict between how we wish to be perceived (characterization, facade) and what we really feel (character) is at the root of all (drama).
      Thus, for a character to be interesting and three dimensional, a character must be conflicted in some way. They will have a facade, built out of aspects that they think is beneficial (whether they are aware of it or not). But as time goes on, the facade will become detrimental instead. Until the character throws off the characterization, they will not win.
      In keeping up their facade, characters will speak according to the way they want to be seen [ITW] unless their guard is down. Hence dialogue, which is important. It should, at some level, reveal intent and how they want to be seen.
      Script for Apocolypse Now (1979), a movie based on Joseph Konrad's Heart of Darkness (1899) Writing the Dialogue
      When a character says something, and does something completely different, they are engaging, and drama comes alive. If dialogue is just explaining behavior, it is not engaging. Dialogue then, should show us character, not explain characterization. In other words, dialogue should not explain what a character is thinking, it should explain
      Key to getting natural sounding dialogue is having a character you can mentally draw forth rather than having to think about each individual line. That part comes later. Countless are the writers that sit in front of a page, thinking of something for a character to say. Instead, we create the character and they speak for themselves.
      For now, creating the character is first.
      To create a character, you must consider it as much as possible, from as many angles as possible. Here are some questions to consider about a character. They're not exhaustive and they're not the even the best, but they are a starting point:
      In public, what are they like? Are they kind, short-tempered, rushed? As soon as they lock themselves in a toilet, away from the public, what are their first wandering thoughts? Where do they come from and where are they going? Are they from a poor place or a place of riches? A quiet place or a busy place? Do they bounce between places? What do they like? What don't they like? If they are on a date, and their food is ordered for them and they don't like it, how do they react? Can they drive? Do they like driving? How do they react to traffic? They find a picture of themselves from the past: depending on when the picture was taken, and with/by whom, how do they react? And so on. The more questions about a character you consider, the deeper and more engaging they become. Eventually, the character becomes so concrete that the dialogue writes itself.
      The man is obviously more experienced, and some what full of himself. He does not wait for her to ask him how he is, instead get its out of the way. "Oh. Of course" the girl replies, in surprise; partly because she considers this an oversight on her part, and partly because the man was slightly rude to her. She is not used to the short-handed, rushed city life that the man experiences. The man is expecting a conversation at the same pace as city-conversation. He realizes his mistake, clears his throat in embarrassment. The subtext here is that they both have much to learn about each other. Their lives run at different speeds and if they are to become good friends, they will have to learn from each other and grow.
      A better example of this in action is the opening scene of the film The Social Network (2010), where the characters talk past each other. There are a myriad of videos analyzing this scene from a writing perspective, so I wont go through that here. A quick search will enlighten you easily.
       
      The opening scene of The Social Network (2010, David Fincher) So, to create dialogue, we must create character. The better the character, the easier the dialogue is to write. In some ways writing character dialogue is acting out the character. Channeling what the character might say if they were here.
      Gathering Character Reference
      You need stuff to create stuff from. In all creative output, the input is just as important. People are characters. You are a character. You put a character out instead of yourself, as we have discussed further up. Therefore, you must talk to people to gather material. People hold a hundred stories about themselves and others. Nearly everyone relishes telling you about themselves, so just ask. Then listen carefully.
      I had a talk with an alcoholic in a pub. He was, in his hay-day, a good property developer and property salesman. We talked for a long time. One of the interesting things he said was his theory about dwindling men. This claim was thus: In the 70s and 80s, men's clubs were shut down en-masse. Because of this there was largely nowhere for men to hang around with other men (that is: without wives and women). Except for betting shops. Therefore the demand for betting shops spiked and many, many, many of them opened. Lots of men dwindled away in betting shops. I asked him if the closing of the mines in the North (and the subsequent massive unemployment) also added to the demand of betting shops. He agreed, satisfied with addition to his theory. But then he tapped his temple with his finger and said "But people like, like us, don't end up- you know, people who are switched on. We don't end up dying slowly in betting shops. Mug's game..." With a triumphant nod he drank his way through what was probably his 25th pint that week. In a dark pub in the afternoon. Conflict personified. 
      Chuck Palahniuk, the author of Fight Club, has hours of talks along this line. Collecting and retelling stories from real people until they take on a life of their own. Looking up any of the talks of Chuck is a must.
      Along with talking to real people, you must also read stories written by authors. And private blogs released by anonymous writers. And listen to confession podcasts. And character study documentaries. And so on.
      There is a documentary on a handful of influential flat earth advocates called Behind The Curve. It is it not a very deep slice of the flat earth ideology or belief, but it is great character study of these characters.
      One of the characters is a woman called Patricia Steere, who runs a Youtube channel centered around discussions and daily chats on flat earth theory, the flat earth community, flat earth news, and so on. She's stand-out in the conspiracy community, and she does not look like a conspiracist at all. She wasn't always a flat-earther either: she arrived there via a path of different conspiracies. Lizard people, global control, etc. As her channel gained in popularity, and more people from the conspiracy community noticed her, conspiracies about her started to circulate.
      The problem with being in the conspiracy community is that your beliefs are ridiculed constantly, therefore the big, bad world is always against you. So it is natural to feel that if someone does not believe as you believe, they are the enemy. This can even go as far as to other members who's beliefs have changed. They've been compromised.
      There is a short segment in the documentary where she says something along the lines of (and this is not verbatim, but the jist is there):
      Then, there is a moment where she is just on the cusp of a realization. As she is speaking you can see the gears in her head clank to a halt as she thinks: What they are saying about me is silly and not true. I have said the same things in the past about other people. It was not true and silly then. And therefore, what if flat earth is not true? Have I been wrong all this time?
      Then, just as there is about to be a logical explosion somewhere deep in her brain. She brushes it off with some comment and continues to believe what she believes. Instead of dealing with that sudden break in the pattern, she simple ignores it. The conflict within this character just flashed by in a monumental, internal battle and the illogical side won.
      It's a compelling five seconds.
      People can be a collection of compelling five second flashes.
      Summary
      Are you staring at a page, wondering what a character is going to say? You have not developed the character enough for them to speak yet. You will have to think through your character and build up aspects of them to dislodge dialogue. A quick search for character building questions is a good start.
      Your character is made. But they are stiff and not very engaging. They will need conflict, a facade. They need friction and difficulty. 
      Characters make characters of themselves.
      Keep a look out for characters in real life.
      NOTE: This article is a condensed compilation of a lot of notes from meetings. It is also a mirror to the original article I wrote and posted on Minds
    • By Weston Bradford
      Ageism, Benefits, Education, Experience, Locations, Competitiveness, Industry Crossover, Contracts

      My Big Qs about the game industry before I get into it. I wanted to contact HR from several companies to ask these questions, but I can’t figure out how. I don’t know anyone in the industry. I want my answers straight from the horse’s mouth because I’m in a bit of denial about my ability to join the industry, so here I am.

      A little about my position. I loved art in high school and I still do projects here and there. I joined the military straight out of high school and am looking at the end of my enlistment (+2 extensions due to my inability to decide what to do with my career). My problem is that I want to change my career field, but the only thing I have interest in is doing digital art (2D, 3D, environment, character, what have you) for a AAA animation or video game company. I realize it will take time to get there. I’ve been so interested that I have experienced emotional stress by the thought that the industry might not be right for me and my family. I’m looking at 3 big options at the moment. 1: Separate from the military with my GI Bill, go to college full time for Media Art and get a part time job. 2: Separate from the military and get a well paying job doing what I do in the military, and slowly work towards games over the next several years. 3: Stay in the military, work on art education and skills, retire in 13 years, then go for games.
      So here are my questions.

      Ageism: I’m 26 and would like to get into the industry now, but I likely will be unable to for a few years due to financial trouble. That’s why options 2 & 3 are good. But 3 means I’ll retire from the military at 38. And 2 likely means not making game art a career for another 5 years or so. That makes me worry about my age in the game industry. I’ve seen a disturbing number of articles and posts talking about how there are very few in game development older than 40. I don’t want to work towards a career that wont hire me by the time I’m ready for it, nor do I want to join a field that will be done with me by the time I’m 40. Is all this hype about age discrimination in the game industry really that bad? Is it any worse than any other career field? How many people work in your studios who are 40+, relative to the size of the studio, and what kind of roles do they play?

      Benefits: I realize these are different company to company, but what is typical? Of smaller companies and of AAA ones? Health insurance, family insurance, education, time off (paid/not paid)?

      Education: Do game companies typically hire from particaular schools, famous ones, or ones nearby the studio? Or is it solely based on portfolio? I’m enrolled at a state school (UNO) that seems to me has a pretty good Graphic Design/Media Arts program. What might the employer think of state schools?

      Experience: What kind of experience should I be looking to obtain to get into a AAA studio? What should I have on my resume and in my portfolio?

      Locations: It looks like 80% of the jobs in the US are in California. Mostly based on the population and cost of living, I’m not sure I want to live there. Will I be hard pressed to find opportunities outside CA? I know there is some stuff near Seattle, WA, which sounds great. Pretty sure there isn’t much here in Omaha, NE.

      Competitiveness: I know crunch time (whether a management failure or not) is a big thing. I’ve worked 70 hours in a week before and studied on top of it, so I’m no foreigner. Is it a regular thing though? Besides that, what do the managers expect from their employees as far as skills and growth? Do you have to be the best to advance in your career? What does advancement look like? Where is the glass ceiling?

      Industry Crossover: As I mentioned before, I’m interested in game art as well as animated movie art. I imagine these are so closely related one could readily hop between the two. Is this true, and if not, why?

      Contracts: Are most game development jobs temp jobs? Will I be hopping from studio to studio for years on end? Are AAA studios like that? It sound like an adventure, but I don’t want to have to move my family every year or two.
      I thank you greatly for any answers you provide.
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!