[.net] Can memory allocation be sped up with multiple threads?

Started by
13 comments, last by Fiddler 14 years ago
In our C# application we are trying to create large amounts of data, and a lot of this work is just constructing data objects. I tried splitting the work up into multiple threads with no shared data between them; the objects are stored on thread-local lists then combined in the main thread upon completion. But I saw barely any speed improvement on my dual-core machine. So I created a console application that just spawns two threads where each one just creates a bunch of data. Here is the thread function (not complete code):

      static void ThreadFunction ( object obj )
      {
         Data data = obj as Data;

         ProcessThread pt = GetProcessThread( );
         if( pt != null )
         {
            Console.WriteLine( "Got PT!" );
            pt.IdealProcessor = (data.Index % 2);
            pt.ProcessorAffinity = (IntPtr)3;
            pt.PriorityLevel = ThreadPriorityLevel.AboveNormal;
         }

         Signals[0].WaitOne( );
         
         double start = pt.TotalProcessorTime.TotalSeconds;

         int count = COUNT / NUM_THREADS;
         data.Str = new string [count];
         data.Doubles = new double [count];
         data.Dictionary = new Dictionary<string,object> [count];
         
         for( int s = 0; s < count; s ++ )
         {
            data.Str = s.ToString();
            data.Doubles = (double) s;
            data.Dictionary = new Dictionary&lt;string,object&gt;( );
            data.Dictionary.Add( data.Str, null );
         }
         
         double time = pt.TotalProcessorTime.TotalSeconds - start;
         
         Signals[1].Release();
         
         Console.WriteLine( "Thread #" + (data.Index+1) + " time = " + time );
      }
</pre>

I experimented with processor affinity and found the above method of specifying affinity and ideal processor to yield the best results, although the difference is minor. GetProcessThread is a function I implemented which returns the ProcessThread for the current thread so I can set those values. The signals are semaphores used to synchronize both the start and end of the main section.

Anyway, the CPU time for each thread is about half of the CPU time if I just create &#111;ne thread with twice the data, but the total execution time is almost the same, meaning the threads spend half of the main loop just waiting. Why? There are no locks. Is it possible I've hit the memory-bandwidth limit (seems highly unlikely) or is it a problem with .NET? Is there any solution?
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Advertisement
Memory allocation in the CLR is basically a pointer increment; it's not like C++. You will not speed up the allocations by performing them on multiple threads -- in fact you might slow it down due to any internal locking the allocator needs to perform. You can't make addition much faster, really.

If your constructors are very expensive, then you might see a benefit. But you're not allocating objects with constructors in these tests.

What is the scenario you are actually trying to optimize (in detail)? This benchmark isn't very practical.
We have a logistics application that shows a globe like Google Earth with icons, labels, symbols, etc. placed on the map. 75,000 icons take about five seconds to fully create and we are supposed to reduce that number. I create two threads and create half the objects on each to split up the workload. This resulted in a 13% speed improvement -- OK but underwhelming.

The constructors alone take almost two seconds even though they have no locks at all so that's why I created the test application - to see if multiple threads that allocate memory actually save time or not.

The thing is, even if I modify the thread function so memory allocate occurs before the "main section" (in-between the signals) I still get extremely poor parallelization. This is the new thread function:

      static void ThreadFunction ( object obj )      {         Data data = obj as Data;         ProcessThread pt = GetProcessThread( );         if( pt != null )         {            Console.WriteLine( "Got PT!" );            pt.IdealProcessor = (data.Index % 2);            pt.ProcessorAffinity = (IntPtr)3;            pt.PriorityLevel = ThreadPriorityLevel.AboveNormal;         }         int count = COUNT / NUM_THREADS;         data.Str = new string [count];         data.Doubles = new double [count];         data.Dictionary = new Dictionary<string,object> [count];         for( int s = 0; s < count; s ++ )            data.Dictionary = new Dictionary&lt;string,object&gt;( );<br>         <br>         Signals[0].WaitOne( );<br>         <br>         double start = pt.TotalProcessorTime.TotalSeconds;<br><br>         for( int s = 0; s &lt; count; s ++ )<br>         {<br>            data.Str = s.ToString();<br>            data.Doubles = (double) s;<br>         }<br>         <br>         double time = pt.TotalProcessorTime.TotalSeconds - start;<br>         <br>         Signals[1].Release();<br>         <br>         Console.WriteLine( "Thread #" + (data.Index+1) + " time = " + time );<br>      }<br></pre><br>All that happens in the timed section is I set the values of a bunch of strings and doubles. Here is what the main function looks like:<br><pre><br>      [MTAThread]<br>      static void Main (string[] args)<br>      {<br>         Thread [] threads = new Thread [NUM_THREADS];<br>         int count = COUNT / NUM_THREADS;<br><br>         for( int t = 0; t &lt; threads.Length; t ++ )<br>         {<br>            threads[t] = new Thread( ThreadFunction );<br>            threads[t].SetApartmentState( ApartmentState.MTA );<br>            Data d = new Data( );<br>            d.Index = t;<br>            threads[t].Start( d );<br>         }<br><br>         // should use a semaphore to wait for the threads to finish<br>         // allocating memory before the main loop but out of laziness<br>         // just put in a really long sleep:<br>         Thread.Sleep(5000);<br>         <br>         long start = Timing.Ticks;<br>         <br>         Signals[0].Release( NUM_THREADS );<br><br>         for( int t = 0; t &lt; threads.Length; t ++ )<br>            Signals[1].WaitOne();<br><br>         Console.WriteLine( "Finished in " + Timing.ElapsedSeconds(start) );<br><br>      }<br></pre><br>The 'Timing' calls are a static utility class that use Stopwatch. Anyway, with this version, two threads take 89% as long as &#111;ne thread - little speed improvement. And all I'm doing is assigning memory.<br><br>So what's the bottleneck here?
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Wait, you're creating 75k icons at once? Why? Are all 75k icons going to be rendered at once? If so I can't imagine your users will see much of ANYTHING. Perhaps you should delay loading icons until you need a particular set of icons.

Your biggest slowdown is probably just disk accesses. 75k small icons is a lot of seeking work for the HD. Allocations are about as fast as they can be, as jpetrie mentions: the .Net allocator essentially just moves a pointer forward. If you have the ability to adjust your information storage mechanisms, then you should probably just batch load the whole chunk into a singular array of data (allocated on the LOH), and then just use indexes and length offsets into said array.

Also, if you're going to use threads... then actually use threads, drop all that nonsense with waiting for signals and just do your work in the thread. If you need to time it, then time it appropriately, not by blocking on a signal. Also, profile. Don't guess.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

I find hard to believe you have 75,000 unique icons.
As mentioned you are IO bound.

The fastest way to get data off disk and into RAM is to bit-blast it into a packed array with one read.
Unsafe C#, 'fixed' length strings etc...
You can do that in one background thread so the app continues to start-up load otherwise.

What are going to do once you have a 1,000,000 PoI's?
Require Windows 64?

Seems odd to new 75,000 Dictionary's.
You could do the same job with 1 dictionary and a better hash.

I presume you do have a dual/quad core computer?
(Will your clients?)
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
The icons do indeed need to all be on the map - they will be aggregated in most cases (but not all) but the data needs to be there in memory. I can render all 75k at 60 FPS though.

There is no disk access at all - when I say object creation, I mean just creating the icon objects, without loading the images, creating the meshes, etc. "obj = new Icon()" done 75k times is an oversimplification of what my application is doing but it's roughly accurate.

Semaphores were used in this sample code to get reliable time values. I wanted the two threads to do the same processing at the same time. In my actual application I use semaphores to ensure objects dependent on other objects are created last. But this isn't about that.

The test case I created is extremely simple. It does totally straightforward variable assignment and still takes 89% as long as single-threaded. I did some more experimenting with it and discovered something interesting.

Say I initialize the data structures like so:
int count = COUNT / NUM_THREADS;data.Str = new string [count*COUNT];data.Doubles = new double [count*COUNT];

Now if, in the timed section, I run this code, two threads take almost half the time of one thread, which is excellent:
for( int s = 0; s < count; s ++ )for( int t = 0; t < COUNT; t ++ ){   data.Str[t] = t.ToString();   data.Doubles[t] = (double)t;}

Specifically when COUNT is 3000 I get 1.45 seconds for one thread and 0.76 seconds for two.

However, if instead I set [s*t] instead of [t], meaning each array element is set just once instead of multiple times redundantly, then I get 6.5 seconds for one thread and 6.1 seconds for two threads. So the two threads took 94% of the time as one thread which is bad.

Strange.
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Quote:Original post by Shannon Barber
You do not have 75,000 unique icons.

They are managed individually. Any icon can be moved, resized, rotated, or changed to some other image, etc. Even if I did not create an object to represent each one, I'd still need to put all that individual-specific data somewhere - ID, location, image, rotation, opacity, whether or not it's visible, various aggregation data, and many other things. So obviously those should be placed into an object that represents an individual icon.
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Is the information and the icons itself in seperate files? See if you can put them all in a single file (one big icon file and one big data file) that is much faster than loading 75K individual files.

Perhaps you could try to optimize by using lazy loading (only load what you need when you need it)
I used the code below to measure allocation time (in release mode), and a typical output is:

w/struct wrapper----------------contents of a random Icon:        544030741 =     841578993        1728885252 =    1537408804        80803213 =      397631142        1026339228 =    1469532357        1641853513 =    1096758211        431056886 =     677753713        953717931 =     531235070        134192515 =     997216291        846327357 =     1435293272        157979954 =     1016132776w/struct wrapper:               00:00:01.3951300w/o struct wrapper------------------contents of a random Icon:        695847830 =     1373447411        408229506 =     515350655        1958031259 =    880850619        469875298 =     1515812807        2079726545 =    1140552211        185385462 =     96911858        1199035947 =    149279508        1976482201 =    799179110        1442314207 =    1733777608        1284052438 =    255298557w/o struct wrapper:             00:00:01.5275294w/wrapper slower:               Falsetest speed ratio:               0,913324483312727Hit enter...


Now, I don't know enough of your application to draw a conclusion; 75k objects with a non-trivial constructor is still pretty fast.

I immediately thought "Flyweight" and "lazy loading" as soon as I saw the requirement of 75k objects. Would that help?

Alloc test code:
using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Diagnostics;namespace AllocTest{    public struct Wrapper<T>        where T : class    {        public T Reference;            }    public class Icon    {        public string ImagePath { get; set; }        public int X { get; set; }        public int Y { get; set; }        public int Z { get; set; }        public bool Visible { get; set; }        public float Opacity { get; set; }        public IDictionary<string, string> Properties { get; set; }        public Icon()        {            this.ImagePath = string.Empty;            this.Properties = new Dictionary<string, string>();            Random random = new Random();                        for (int i = 0; i < 10; i++)            {                var key = random.Next();                var value = random.Next();                this.Properties.Add(key.ToString(), value.ToString());               }                 }    }    class Program    {        private const int Count = 75000;        static void Main(string[] args)        {            Console.WriteLine("w/struct wrapper");            Console.WriteLine("----------------");            Random peekIndex = new Random();            Stopwatch timer = Stopwatch.StartNew();            Wrapper<Icon>[] alloc = new Wrapper<Icon>[Count];            for (int i = 0; i < alloc.Length; i++)            {                alloc = new Wrapper<Icon>() { Reference = new Icon() };            }            timer.Stop();            var time1 = timer.Elapsed;            var index = peekIndex.Next(Count);            // just to make sure the array and its content are actually used            Console.WriteLine("contents of a random Icon:");                        var icon = alloc[index].Reference;            foreach (var prop in icon.Properties)            {                Console.WriteLine("\t{0} =\t{1}", prop.Key, prop.Value);            }            Console.WriteLine("\nw/struct wrapper:\t\t{0}\n", time1);                       timer.Reset();            Console.WriteLine("w/o struct wrapper");            Console.WriteLine("------------------");            timer.Start();            Icon[] alloc2 = new Icon[Count];            for (int i = 0; i < alloc2.Length; i++)            {                alloc2 = new Icon();            }            timer.Stop();            var time2 = timer.Elapsed;            index = peekIndex.Next(Count);            Console.WriteLine("contents of a random Icon:");            icon = alloc2[index];            foreach (var prop in icon.Properties)            {                Console.WriteLine("\t{0} =\t{1}", prop.Key, prop.Value);            }            Console.WriteLine("\nw/o struct wrapper:\t\t{0}\n", time2);            Console.WriteLine("\nw/wrapper slower:\t\t{0}\ntest speed ratio:\t\t{1}", time1 > time2, time1.TotalMilliseconds / time2.TotalMilliseconds);            Console.Write("Hit enter...");            Console.ReadLine();        }    }}
Rainweaver Framework (working title)

IronLua (looking for a DLR expert)



Regarding image loading, there are very few images that are ever loaded. If all 75k icons use the same image then there is only one image load from disk. But that's not what I've been profiling.

As for the struct wrapper, I've never heard of that. Can't imagine why it would speed things up but I'll test it for myself, then see if there's an efficient way to use it for the objects I'm creating. Thanks for the suggestion!

As for lazy loading, the library I'm creating cannot do that since it needs to immediately handle anything it's told to, but the applications that use it could always stream objects in gradually if they wanted to. But they don't. They want 75k icons to appear onscreen as quickly as possible.
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.

This topic is closed to new replies.

Advertisement