Getting the number of processors and cores?

Started by
15 comments, last by Jan Wassenberg 15 years, 3 months ago
I'm working on a fractal program and I'm going to make it multithreaded and multiprocessed. In order to maximize efficiency, I want to be able to automatically detect how many processors a computer has and how many cores each processor has. I looked at GetSystemInfo (for windows), but the dwNumberOfProcessors member of SYSTEM_INFO returned 2 when I ran it on my machine (which is a dual core AMD Turion 64 with only one processor). I was hoping it would tell me the number of processors on my machine, but it appears it gives the number of cores. I also looked at boost::thread::hardware_concurrency(), but I want to get the number of processors and the number of cores separately, as 2 different numbers. Anyone know how to do this? I am, of course, working on Windows, but info for doing this on Mac & Linux would be great too.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
Advertisement
1) For the moment, more than one processor is the realm of servers and specialized workstations. With the trend of processors headed towards the 80core designs, I don't see it changing soon. So the GetSystemInfo is really all the info you should need.

2) Let the OS manage the threads. It has the best chance of getting things right across all systems you want your program to run on. Making your program "multi process" is going to be a lot more work than just spawning threads, and it isn't likely to gain you much, especially since it robs the OS of the knowledge that thread A1 A2 .. AN are related to B1 B2 .. BN.

3) You are most likely to see gains from proper threading. After you have it working, you can tune thread job sizes to get the most out of the cache with the minimal thrashing. Then you can tune how you interleave your data set into jobs to get some more gains.

4) Look into something like CUDA. Depending on how you do the fractal calculations, you may be able to take advantage of the massively parallel GPU to render the fractal faster than you can on a generic CPU.
Do you really need to distinguish between chips in discrete sockets and cores? What difference does it make to you if a machine has two processors in two sockets or two cores on the same chip?

I don't know of userland APIs to do this, but you can do it with the cpuid instruction on more recent Intel CPUs if you want: http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/.
CPUID assembly instruction could get that stuff if the processor supports it, here's an Intel page telling you how to use it with code and a test app (code is very small) Clicky Hope this helps [smile]

You didn't come into this world. You came out of it, like a wave from the ocean. You are not a stranger here. -Alan Watts

Quote:Original post by KulSeran
1) For the moment, more than one processor is the realm of servers and specialized workstations. With the trend of processors headed towards the 80core designs, I don't see it changing soon. So the GetSystemInfo is really all the info you should need.
I know it's overkill, but it's a challenge I'd really like to take on. I'd eventually like to test it out on a workstation on my university and see how it fairs, just for kicks and giggles.

Quote:Original post by KulSeran
2) Let the OS manage the threads. It has the best chance of getting things right across all systems you want your program to run on.
I do plan on letting the OS manage the threads. I just want to know how many to create; I don't want to try to tell them which core to run on. Same with the multiple processors. I just want to create a certain number. The OS can decide where to run them.

Quote:Original post by KulSeran
Making your program "multi process" is going to be a lot more work than just spawning threads, and it isn't likely to gain you much, especially since it robs the OS of the knowledge that thread A1 A2 .. AN are related to B1 B2 .. BN.
Yes, it'll be more work, but it really should give me a huge benefit... unless I've misunderstood cores and processors. Let's say I have my program operating on 1 process and many worker threads that do my calculations (and assume the computer has multiple cores). If the computer has multiple processors, can the threads from one process execute on different processors, or are they limited to the process's processor's cores? If they can execute on multiple processors, then yes, creating multiple processes is overkill with no benefit, so I won't do it (let me know if this is true though; I don't know if it is. I've always thought it was the opposite.). But otherwise, it would be nice to try to access the power of those other processors, which would require multiple processes.

Quote:Original post by KulSeran
4) Look into something like CUDA. Depending on how you do the fractal calculations, you may be able to take advantage of the massively parallel GPU to render the fractal faster than you can on a generic CPU.
I've heard the term CUDA before, but I've never actually looked it up. Time to do some research. Thanks for the tip :)

Quote:Original post by outRider
Do you really need to distinguish between chips in discrete sockets and cores? What difference does it make to you if a machine has two processors in two sockets or two cores on the same chip?
This relates to my answer to 2) of KulSeran's post. Can two threads from one process be executed on separate CPUs? Or do they have to be executed on the same CPU? I always thought it was the latter, but if it's the former then that completely changes everything.
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
Windows treats all "processes" as threads. A "process" only denotes the address space a set of threads can use.
MSDN says that it attempts to schedule threads on the hardware that is closest to the memory being used. On a NUMA machine that means you need another process (another adress space) for each processor die. On a SMP system the processor die is treated the same as another processor core.

I believe Linux does something similar.

Quote:Original post by KulSeran
Windows treats all "processes" as threads. A "process" only denotes the address space a set of threads can use.
MSDN says that it attempts to schedule threads on the hardware that is closest to the memory being used. On a NUMA machine that means you need another process (another adress space) for each processor die. On a SMP system the processor die is treated the same as another processor core.

I believe Linux does something similar.


So in other words, the thread may or may not be able to execute on a different processor, depending on whether or not its SMP or NUMA, respectively? Does that mean I'd have to start new processes on a NUMA machine in order to utilize the other processors?
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
Quote:Original post by MikeTacular
So in other words, the thread may or may not be able to execute on a different processor, depending on whether or not its SMP or NUMA, respectively?

Not quite. That is how it ATTEMPTS to schedule threads.

The OS is free to schedule it however it wants.

Assuming four threads and four processors, any of these are allowed:
Processor:  0  1  2  3----------------------# Threads:  4  0  0  0OR:         3  1  0  0OR:         2  2  0  0OR:         2  1  1  0OR:         1  1  1  1


Since you are trying to maximize efficiency:
* assuming each thread has stuff to keep it busy equally,
* and assuming you have a parallel algorithm,
* and assuming your data is partitioned correctly,
* and assuming your data is distributed correctly,
* and assuming you have uniform communication between threads,
* and assuming there is no affinity for a particular processor
= then you should use the number from dwNumberOfProcessors and let the OS do the scheduling by itself.

Parallel processing requires you to develop parallel algorithms, design for parallel communication, and prepare for parallel data.

If you try to write parallel code without designing and developing your code properly, then the result may range from slow execution to horrible bugs that seem to spawn from the depths of hell.
Ok, cool, thanks frob! That was a great summary. The problem is indeed embarrassingly parallel, and I think I've got a pretty good parallel algorithm. I'll describe it below and let you (and everyone else) critique it.

When I render the fractal, I just want the color of each pixel. Each pixel in the fractal is independent of the other pixels. So I can create a 2D array of pixels representing the image. This 2D array will be created in the main thread and passed to the worker threads for them to fill, where each thread fills a portion of the array. When a worker thread is started, it is told to render a certain chunk of the image (and not two chunks overlap). The worker thread can then calculate what it needs to for each pixel in its assigned chunk, and after a pixel has been calculated, its color is then assigned to the corresponding pixel in the 2D array. After a worker thread finishes calculating its chunk, it signals the main thread telling it it's finished, and then terminates. After the main thread has all worker threads report back as finished, we know we are done and that the 2D array has been filled with the render.

Does that sound ok?
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]
I'd break it up a BIT more than that.
Have your main program spawn N threads (one per core).
Each thread takes the form of:

while ( job j = jobs.get_next() )
{
j.run()
}

where jobs.get_next() is really a wrapper of the form Lock() PopNextJob() Unlock()

The main thread would then break the image into parts (for simplicity sake, lets say scan lines of the image), and would
enqueue into jobs listing the image array and the start and end line to render.

for ( int i = 0; i < imageHeight; i += jobsize )
{
jobs.add ( job ( i, i+jobsize, dataArray ) );
}

The reason you want to do this is twofold.
1) Not every scanline will take the same time. Look at the standard "identity" plot of the mandelbrot fractal. Lines 0 and N will take significantly less time than line N/2. The scanline on the top/bottom is mostly non-fractal divergent pixels. The scanline in the middle is mostly fractal, and will take forever to compute.

Imagine rendering a 4096x4096 plot of the mandelbrot fractal.
With 4 threads in this case, and only 4 jobs (each 1/4 of the image) the two threads that get the top/bottom will finish far faster than the 2 threads running the middle. So you lost a lot of compute power. with 4 threads and 64 jobs all 4 threads get a chance to work on the easy parts and the hard parts, and they will all finish closer to the same time. with 4 threads and 4096 jobs, each thread gets a single scanline, and you will waste a tonne of time locking to deque the next job.

So you need to be able to tune your job sizes and count.

2) Depending on the size of the data, the memory access pattern can matter. You may need to select jobs to cover particular parts at particular times image in order to minimize cache thrashing. But this is really very very secondary to my first point. As you are likely to be quite compute bound.

This topic is closed to new replies.

Advertisement