static void ThreadFunction ( object obj )
{
Data data = obj as Data;
ProcessThread pt = GetProcessThread( );
if( pt != null )
{
Console.WriteLine( "Got PT!" );
pt.IdealProcessor = (data.Index % 2);
pt.ProcessorAffinity = (IntPtr)3;
pt.PriorityLevel = ThreadPriorityLevel.AboveNormal;
}
Signals[0].WaitOne( );
double start = pt.TotalProcessorTime.TotalSeconds;
int count = COUNT / NUM_THREADS;
data.Str = new string [count];
data.Doubles = new double [count];
data.Dictionary = new Dictionary<string,object> [count];
for( int s = 0; s < count; s ++ )
{
data.Str = s.ToString();
data.Doubles = (double) s;
data.Dictionary = new Dictionary<string,object>( );
data.Dictionary.Add( data.Str, null );
}
double time = pt.TotalProcessorTime.TotalSeconds - start;
Signals[1].Release();
Console.WriteLine( "Thread #" + (data.Index+1) + " time = " + time );
}
</pre>
I experimented with processor affinity and found the above method of specifying affinity and ideal processor to yield the best results, although the difference is minor. GetProcessThread is a function I implemented which returns the ProcessThread for the current thread so I can set those values. The signals are semaphores used to synchronize both the start and end of the main section.
Anyway, the CPU time for each thread is about half of the CPU time if I just create one thread with twice the data, but the total execution time is almost the same, meaning the threads spend half of the main loop just waiting. Why? There are no locks. Is it possible I've hit the memory-bandwidth limit (seems highly unlikely) or is it a problem with .NET? Is there any solution?
[.net] Can memory allocation be sped up with multiple threads?
In our C# application we are trying to create large amounts of data, and a lot of this work is just constructing data objects. I tried splitting the work up into multiple threads with no shared data between them; the objects are stored on thread-local lists then combined in the main thread upon completion. But I saw barely any speed improvement on my dual-core machine.
So I created a console application that just spawns two threads where each one just creates a bunch of data. Here is the thread function (not complete code):
Memory allocation in the CLR is basically a pointer increment; it's not like C++. You will not speed up the allocations by performing them on multiple threads -- in fact you might slow it down due to any internal locking the allocator needs to perform. You can't make addition much faster, really.
If your constructors are very expensive, then you might see a benefit. But you're not allocating objects with constructors in these tests.
What is the scenario you are actually trying to optimize (in detail)? This benchmark isn't very practical.
If your constructors are very expensive, then you might see a benefit. But you're not allocating objects with constructors in these tests.
What is the scenario you are actually trying to optimize (in detail)? This benchmark isn't very practical.
We have a logistics application that shows a globe like Google Earth with icons, labels, symbols, etc. placed on the map. 75,000 icons take about five seconds to fully create and we are supposed to reduce that number. I create two threads and create half the objects on each to split up the workload. This resulted in a 13% speed improvement -- OK but underwhelming.
The constructors alone take almost two seconds even though they have no locks at all so that's why I created the test application - to see if multiple threads that allocate memory actually save time or not.
The thing is, even if I modify the thread function so memory allocate occurs before the "main section" (in-between the signals) I still get extremely poor parallelization. This is the new thread function:
The constructors alone take almost two seconds even though they have no locks at all so that's why I created the test application - to see if multiple threads that allocate memory actually save time or not.
The thing is, even if I modify the thread function so memory allocate occurs before the "main section" (in-between the signals) I still get extremely poor parallelization. This is the new thread function:
static void ThreadFunction ( object obj ) { Data data = obj as Data; ProcessThread pt = GetProcessThread( ); if( pt != null ) { Console.WriteLine( "Got PT!" ); pt.IdealProcessor = (data.Index % 2); pt.ProcessorAffinity = (IntPtr)3; pt.PriorityLevel = ThreadPriorityLevel.AboveNormal; } int count = COUNT / NUM_THREADS; data.Str = new string [count]; data.Doubles = new double [count]; data.Dictionary = new Dictionary<string,object> [count]; for( int s = 0; s < count; s ++ ) data.Dictionary = new Dictionary<string,object>( );<br> <br> Signals[0].WaitOne( );<br> <br> double start = pt.TotalProcessorTime.TotalSeconds;<br><br> for( int s = 0; s < count; s ++ )<br> {<br> data.Str = s.ToString();<br> data.Doubles = (double) s;<br> }<br> <br> double time = pt.TotalProcessorTime.TotalSeconds - start;<br> <br> Signals[1].Release();<br> <br> Console.WriteLine( "Thread #" + (data.Index+1) + " time = " + time );<br> }<br></pre><br>All that happens in the timed section is I set the values of a bunch of strings and doubles. Here is what the main function looks like:<br><pre><br> [MTAThread]<br> static void Main (string[] args)<br> {<br> Thread [] threads = new Thread [NUM_THREADS];<br> int count = COUNT / NUM_THREADS;<br><br> for( int t = 0; t < threads.Length; t ++ )<br> {<br> threads[t] = new Thread( ThreadFunction );<br> threads[t].SetApartmentState( ApartmentState.MTA );<br> Data d = new Data( );<br> d.Index = t;<br> threads[t].Start( d );<br> }<br><br> // should use a semaphore to wait for the threads to finish<br> // allocating memory before the main loop but out of laziness<br> // just put in a really long sleep:<br> Thread.Sleep(5000);<br> <br> long start = Timing.Ticks;<br> <br> Signals[0].Release( NUM_THREADS );<br><br> for( int t = 0; t < threads.Length; t ++ )<br> Signals[1].WaitOne();<br><br> Console.WriteLine( "Finished in " + Timing.ElapsedSeconds(start) );<br><br> }<br></pre><br>The 'Timing' calls are a static utility class that use Stopwatch. Anyway, with this version, two threads take 89% as long as one thread - little speed improvement. And all I'm doing is assigning memory.<br><br>So what's the bottleneck here?
Wait, you're creating 75k icons at once? Why? Are all 75k icons going to be rendered at once? If so I can't imagine your users will see much of ANYTHING. Perhaps you should delay loading icons until you need a particular set of icons.
Your biggest slowdown is probably just disk accesses. 75k small icons is a lot of seeking work for the HD. Allocations are about as fast as they can be, as jpetrie mentions: the .Net allocator essentially just moves a pointer forward. If you have the ability to adjust your information storage mechanisms, then you should probably just batch load the whole chunk into a singular array of data (allocated on the LOH), and then just use indexes and length offsets into said array.
Also, if you're going to use threads... then actually use threads, drop all that nonsense with waiting for signals and just do your work in the thread. If you need to time it, then time it appropriately, not by blocking on a signal. Also, profile. Don't guess.
Your biggest slowdown is probably just disk accesses. 75k small icons is a lot of seeking work for the HD. Allocations are about as fast as they can be, as jpetrie mentions: the .Net allocator essentially just moves a pointer forward. If you have the ability to adjust your information storage mechanisms, then you should probably just batch load the whole chunk into a singular array of data (allocated on the LOH), and then just use indexes and length offsets into said array.
Also, if you're going to use threads... then actually use threads, drop all that nonsense with waiting for signals and just do your work in the thread. If you need to time it, then time it appropriately, not by blocking on a signal. Also, profile. Don't guess.
I find hard to believe you have 75,000 unique icons.
As mentioned you are IO bound.
The fastest way to get data off disk and into RAM is to bit-blast it into a packed array with one read.
Unsafe C#, 'fixed' length strings etc...
You can do that in one background thread so the app continues to start-up load otherwise.
What are going to do once you have a 1,000,000 PoI's?
Require Windows 64?
Seems odd to new 75,000 Dictionary's.
You could do the same job with 1 dictionary and a better hash.
I presume you do have a dual/quad core computer?
(Will your clients?)
As mentioned you are IO bound.
The fastest way to get data off disk and into RAM is to bit-blast it into a packed array with one read.
Unsafe C#, 'fixed' length strings etc...
You can do that in one background thread so the app continues to start-up load otherwise.
What are going to do once you have a 1,000,000 PoI's?
Require Windows 64?
Seems odd to new 75,000 Dictionary's.
You could do the same job with 1 dictionary and a better hash.
I presume you do have a dual/quad core computer?
(Will your clients?)
The icons do indeed need to all be on the map - they will be aggregated in most cases (but not all) but the data needs to be there in memory. I can render all 75k at 60 FPS though.
There is no disk access at all - when I say object creation, I mean just creating the icon objects, without loading the images, creating the meshes, etc. "obj = new Icon()" done 75k times is an oversimplification of what my application is doing but it's roughly accurate.
Semaphores were used in this sample code to get reliable time values. I wanted the two threads to do the same processing at the same time. In my actual application I use semaphores to ensure objects dependent on other objects are created last. But this isn't about that.
The test case I created is extremely simple. It does totally straightforward variable assignment and still takes 89% as long as single-threaded. I did some more experimenting with it and discovered something interesting.
Say I initialize the data structures like so:
Now if, in the timed section, I run this code, two threads take almost half the time of one thread, which is excellent:
Specifically when COUNT is 3000 I get 1.45 seconds for one thread and 0.76 seconds for two.
However, if instead I set [s*t] instead of [t], meaning each array element is set just once instead of multiple times redundantly, then I get 6.5 seconds for one thread and 6.1 seconds for two threads. So the two threads took 94% of the time as one thread which is bad.
Strange.
There is no disk access at all - when I say object creation, I mean just creating the icon objects, without loading the images, creating the meshes, etc. "obj = new Icon()" done 75k times is an oversimplification of what my application is doing but it's roughly accurate.
Semaphores were used in this sample code to get reliable time values. I wanted the two threads to do the same processing at the same time. In my actual application I use semaphores to ensure objects dependent on other objects are created last. But this isn't about that.
The test case I created is extremely simple. It does totally straightforward variable assignment and still takes 89% as long as single-threaded. I did some more experimenting with it and discovered something interesting.
Say I initialize the data structures like so:
int count = COUNT / NUM_THREADS;data.Str = new string [count*COUNT];data.Doubles = new double [count*COUNT];
Now if, in the timed section, I run this code, two threads take almost half the time of one thread, which is excellent:
for( int s = 0; s < count; s ++ )for( int t = 0; t < COUNT; t ++ ){ data.Str[t] = t.ToString(); data.Doubles[t] = (double)t;}
Specifically when COUNT is 3000 I get 1.45 seconds for one thread and 0.76 seconds for two.
However, if instead I set [s*t] instead of [t], meaning each array element is set just once instead of multiple times redundantly, then I get 6.5 seconds for one thread and 6.1 seconds for two threads. So the two threads took 94% of the time as one thread which is bad.
Strange.
Quote:Original post by Shannon Barber
You do not have 75,000 unique icons.
They are managed individually. Any icon can be moved, resized, rotated, or changed to some other image, etc. Even if I did not create an object to represent each one, I'd still need to put all that individual-specific data somewhere - ID, location, image, rotation, opacity, whether or not it's visible, various aggregation data, and many other things. So obviously those should be placed into an object that represents an individual icon.
Is the information and the icons itself in seperate files? See if you can put them all in a single file (one big icon file and one big data file) that is much faster than loading 75K individual files.
Perhaps you could try to optimize by using lazy loading (only load what you need when you need it)
Perhaps you could try to optimize by using lazy loading (only load what you need when you need it)
I used the code below to measure allocation time (in release mode), and a typical output is:
Now, I don't know enough of your application to draw a conclusion; 75k objects with a non-trivial constructor is still pretty fast.
I immediately thought "Flyweight" and "lazy loading" as soon as I saw the requirement of 75k objects. Would that help?
Alloc test code:
w/struct wrapper----------------contents of a random Icon: 544030741 = 841578993 1728885252 = 1537408804 80803213 = 397631142 1026339228 = 1469532357 1641853513 = 1096758211 431056886 = 677753713 953717931 = 531235070 134192515 = 997216291 846327357 = 1435293272 157979954 = 1016132776w/struct wrapper: 00:00:01.3951300w/o struct wrapper------------------contents of a random Icon: 695847830 = 1373447411 408229506 = 515350655 1958031259 = 880850619 469875298 = 1515812807 2079726545 = 1140552211 185385462 = 96911858 1199035947 = 149279508 1976482201 = 799179110 1442314207 = 1733777608 1284052438 = 255298557w/o struct wrapper: 00:00:01.5275294w/wrapper slower: Falsetest speed ratio: 0,913324483312727Hit enter...
Now, I don't know enough of your application to draw a conclusion; 75k objects with a non-trivial constructor is still pretty fast.
I immediately thought "Flyweight" and "lazy loading" as soon as I saw the requirement of 75k objects. Would that help?
Alloc test code:
using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Diagnostics;namespace AllocTest{ public struct Wrapper<T> where T : class { public T Reference; } public class Icon { public string ImagePath { get; set; } public int X { get; set; } public int Y { get; set; } public int Z { get; set; } public bool Visible { get; set; } public float Opacity { get; set; } public IDictionary<string, string> Properties { get; set; } public Icon() { this.ImagePath = string.Empty; this.Properties = new Dictionary<string, string>(); Random random = new Random(); for (int i = 0; i < 10; i++) { var key = random.Next(); var value = random.Next(); this.Properties.Add(key.ToString(), value.ToString()); } } } class Program { private const int Count = 75000; static void Main(string[] args) { Console.WriteLine("w/struct wrapper"); Console.WriteLine("----------------"); Random peekIndex = new Random(); Stopwatch timer = Stopwatch.StartNew(); Wrapper<Icon>[] alloc = new Wrapper<Icon>[Count]; for (int i = 0; i < alloc.Length; i++) { alloc = new Wrapper<Icon>() { Reference = new Icon() }; } timer.Stop(); var time1 = timer.Elapsed; var index = peekIndex.Next(Count); // just to make sure the array and its content are actually used Console.WriteLine("contents of a random Icon:"); var icon = alloc[index].Reference; foreach (var prop in icon.Properties) { Console.WriteLine("\t{0} =\t{1}", prop.Key, prop.Value); } Console.WriteLine("\nw/struct wrapper:\t\t{0}\n", time1); timer.Reset(); Console.WriteLine("w/o struct wrapper"); Console.WriteLine("------------------"); timer.Start(); Icon[] alloc2 = new Icon[Count]; for (int i = 0; i < alloc2.Length; i++) { alloc2 = new Icon(); } timer.Stop(); var time2 = timer.Elapsed; index = peekIndex.Next(Count); Console.WriteLine("contents of a random Icon:"); icon = alloc2[index]; foreach (var prop in icon.Properties) { Console.WriteLine("\t{0} =\t{1}", prop.Key, prop.Value); } Console.WriteLine("\nw/o struct wrapper:\t\t{0}\n", time2); Console.WriteLine("\nw/wrapper slower:\t\t{0}\ntest speed ratio:\t\t{1}", time1 > time2, time1.TotalMilliseconds / time2.TotalMilliseconds); Console.Write("Hit enter..."); Console.ReadLine(); } }}
Regarding image loading, there are very few images that are ever loaded. If all 75k icons use the same image then there is only one image load from disk. But that's not what I've been profiling.
As for the struct wrapper, I've never heard of that. Can't imagine why it would speed things up but I'll test it for myself, then see if there's an efficient way to use it for the objects I'm creating. Thanks for the suggestion!
As for lazy loading, the library I'm creating cannot do that since it needs to immediately handle anything it's told to, but the applications that use it could always stream objects in gradually if they wanted to. But they don't. They want 75k icons to appear onscreen as quickly as possible.
As for the struct wrapper, I've never heard of that. Can't imagine why it would speed things up but I'll test it for myself, then see if there's an efficient way to use it for the objects I'm creating. Thanks for the suggestion!
As for lazy loading, the library I'm creating cannot do that since it needs to immediately handle anything it's told to, but the applications that use it could always stream objects in gradually if they wanted to. But they don't. They want 75k icons to appear onscreen as quickly as possible.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement