what good are cores?

Started by
22 comments, last by SergioJdelos 7 years, 11 months ago

Reminds me to the learning process i've gone through with GPU compute and LDS.

Since that i really wish i would have something like control over the CPU cache.

But i also noticed most people would not want this. They don't want to code close to metal,

they just want the metal (or the compiler) to be clever enough to run their code efficiently.

Probably they are afraid of additional work.

Personally i think the more control you have the less work is necessary - less trial and error, guessing, hoping and profiling.

On the other hand, on GPU the LDS size limit became a big influence on what algorithms i choose.

E.g. if it would grow twice as large for a new generation of GPUs, i'd need to change huge amounts of code in drastic ways to get best performance again.

So - on the long run - maybe those other people are right? Man should rule the machine - not the other way around?

Advertisement

Memory bandwidth is the bottleneck these days.


Bring on the triple channel! I was very upset when I learned that DDR3 implementations weren't supporting triple channel! I think it was only one or two intel boards that would. Of course you could always build a system using server hardware.

I was far more disappointed when I read several articles about how "we don't need triple channel memory". Well ya no shit we can't make good use of triple channel if it isn't available to develop on numb-nuts!


Quad channel on DDR4 shows next to no improvement, nevermind triple channel.

Why does it show no improvement?

Let's talk about that, actually.

Can the OS not facilitate operations on multiple memory channels in parallel?
Does the software showing no improvement not make use of multiple channels?

The OS cannot see the multiple channels, in fact. More on this in a moment.

It does seem to me though, that if you create a program that creates blocks of data on each channel it is a trivial act to utilize all four channels and achieve that maximum throughput.

How do you create blocks of data on each channel? I'll wait.

You have to remember, first and foremost, that any given program does not interact with the actual memory architecture of the system. Not ever. Let's work from the bottom up - a single stick of memory. No, wait, that's not the bottom. You have individual chips with internal rows and columns, themselves arranged into banks on the DIMM. Access times to memory are variable depending on access patterns within the stick!

But let's ignore the internals of a stick of RAM and naively call each of them a "channel". How do the channels appear to the OS kernel? Turns out they don't. The system memory controller assembles them into a flat address space ("physical" addressing) and gives that to the kernel to work with. Now a computer is not total chaos, and there is rhyme and reason to the mapping between physical address space and actual physical chips. Here's an example. There are no guarantees that this is consistent across any category of machines, of course. Also note that the mapping may not be zero based and please read the comments in that article regarding Ivy Bridge's handling of channel assignment.

Oh but wait, we're not actually interacting with any of that in development. All of our allocations happen in virtual address space. That mapping IS basically chaos. There's no ability to predict or control how that mapping will be set up. It's not even constant for any given address during the program's execution. You have no ability to gain any visibility into this mapping without a kernel mode driver or a side channel attack.

Just a reminder that most programmers don't allocate virtual memory blocks either. We generally use malloc, which is yet another layer removed.

The answer to "how do you create blocks of data on each channel" is, of course, that you don't. Even the OS doesn't, and in fact it's likely to choose an allocation scheme that actively discourages multi-channel memory access. Why? Because it has a gigantic virtual<->physical memory table to manage, and keeping that table simple means faster memory allocations and less kernel overhead in allocation. It's been a while since I dug into the internals of modern day kernel allocators, but if you can store mappings for entire ranges of pages it saves a lot of memory versus having disparate entries for each and every memory page. Large block allocations are also likely to be freed as blocks, making free list management easier. Long story short, the natural implementation of an allocator leans towards creating contiguous blocks of memory. How do you deal with that as a CPU/memory controller designer? Based on the link above, you simply alternate by cache line. Or, you scramble the physical address map to individual DRAM banks and chips. Remember that Ivy Bridge channel assignment bit? Yep, that's what happened.

Frankly, the benefits of multi-channel memory probably show up almost exclusively in heavily multitasking situations that are heavy on memory bandwidth. I bet browsers love it :D

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
I think the benefits, or rather non-benefits of multi-channel come from how the CPU and the RAM chips work. It did make some sense, at least in theory, a decade ago. But with DDR3 and 8x prefetch... It is in my opinion surprising how such a feature could make it mainstream at all. Marketing, eh.

The basic idea between multichannel is the same as with RAID-0. You need to read N blocks from disk and the disk just isn't fast enough, and access times are horrid. So, use two disks (or 3, 4, 5...) and stripe data over them. Now you can read twice (3x, tx, 5x) as much data in the same time. Tadaaaa! Instant win for everybody.

Now, the idea is that you can do with cachelines what you do with disk sectors (the CPU works with memory in terms of complete cache lines). Even with virtual addresses translating to "random" physical addresses, on the average it should still somehow work out. 50% chance that any random physical RAM address is on the respective "not busy" stick. The CPU can already fetch another cache line into L1 while it's still busy fetching the last. With several concurrent threads... Tadaaaa! Instant win.

Except with DDR3 (and DDR4) it's total bollocks because you read 1024 bits per transfer, having 128 pins and 8xPF. Which means the smallest unit that the hardware is able to read from a single stick is already two cache lines worth of contiguous physical memory. Unless the OS pays close attention to its VM mapping, it already throws away 50% every time. Doubling the number of channels only means you proportionally throw away more. Or, if you want to avoid doing that, you must sacrifice precious cache lines speculatively on data that may never be accessed...
By the way what is a job system? Looking at the various link provided it seems very close to the "task" structure in openmp 3 (unfortunately not supported by msvc) or in the upcoming c++17 standard. On the other hand task can create subtasks that can be picked by other thread depending on the scheduler while it looks like all job are generated by main thread while others threads are picking a job.

By the way what is a job system? Looking at the various link provided it seems very close to the "task" structure in openmp 3 (unfortunately not supported by msvc) or in the upcoming c++17 standard. On the other hand task can create subtasks that can be picked by other thread depending on the scheduler while it looks like all job are generated by main thread while others threads are picking a job.

There's no standard definition, but in general it's a system that allows you to package up a function and some data (a job/task) and push it into a queue for execution, which is serviced by a thread pool. Usually you need some kind of job-dependency system as well.

In some of these systems, jobs cannot spawn jobs, and only the main thread creates them. In other systems, it's fine for jobs to create other jobs. In some, there's no such thing as a "main thread". Some are implemented using fibers (as well as threads), so that a job can yield part way through (instead of returning), and then resume when a dependency has completed.

I posted these two links on the first page that describe some real world ones:

https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/

http://fabiensanglard.net/doom3_bfg/threading.php

Quad channel on DDR4 shows next to no improvement, nevermind triple channel.

Actually, Quad channel makes sense in the HEDT platform. A mainstream i7 can have 4 cores, but the HEDT has up to 10 (and 22 in Xeons). So, if it has 8 cores, it will need 4 memory channels to have the same bandwidth per core as the mainstream i7 using dual channel. So a brand new 6950x (10 cores, HEDT) have less bandwidth per core than a 6700k (4 cores, mainstream)

This topic is closed to new replies.

Advertisement