• Advertisement

Archived

This topic is now archived and is closed to further replies.

need to optimize this code...badly

This topic is 4971 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am working on a program whose main function is to read in a large (possibly VERY large) array of floating point data and then create a bitmap from this data so that it can be displayed to the screen. Unfortunately, my code is SLOW. One of the files with the floating point values has over 16 million pieces of data...converting all of these to bytes that can be used to generate a bitmap is time consuming indeed. Anyway, here is what I'm currently doing: 1) Read the floating point data from file into rawData. 2) Calculate the max and min values in rawData. 3) Using max and min, determine the slope and y-intercept of a line that maps min to 0 and max to 255 and all the other points in between. (i.e., byteData[ i ] = rawData[ i ] * slope + b) 4) Using a loop, apply the above formula to every single piece of data in rawData to generate the corresponding byteData equivalent. 5) Create a bitmap using byteData so that the user can view a graphical representation of the original rawData. One thing to keep in mind is that I have no control over the format in which the original data is stored (it is always in float format). Also, in my loop, rawData[ i ], slope, and b are all floating point format, which is very computationally expensive. I have used a Linux version of a similar program, and it can load and display the same 16 million data point file practically instaneously. My program currently takes on the order of about 2 or 3 minutes on my 1.8G. My program is fine as far as what it does, but horrible in terms of speed and efficiency. Any suggestions on ways to speed this up are appreciated. [edited by - StonieJ on June 8, 2004 8:15:29 AM]

Share this post


Link to post
Share on other sites
Advertisement
Well the first thing is to profile you code to find out where the bottlenecks are.

Share this post


Link to post
Share on other sites
Is your image 256 pixels high and 16 million pixels wide ????

Well anyway :

- do you use the full optimization options ?

If it takes that much time I doubt. This gives 180 seconds * 1,8 billion cycles for 16 million iterations. roughly 200 cycles per iteration. It should be 20 times faster.

- make sure the compiler does not call ftoi() which is one of the most reknown perf killers on any compiler. Normally some optimizations replace it by much faster code. Big number trick or a single asm instruction.

and b are all floating point format, which is very computationally expensive.
Now this is very fast on any machine since 1995, you have no better option except using SIMD.

Your potential performance leaks come from, sorted by importance :
- ftoi
- debug info, killing optimizations.
- no loop unrolling
- cache (I doubt)

Next what you can do is unroll by 4, and accumulate four bytes into a 32 bit int with ors and shifts. Then write 4 bytes at once. Storing bytes one by one is a known perf killer too.

Share this post


Link to post
Share on other sites
where does the profiler say it is spending most of its time?

Share this post


Link to post
Share on other sites
Hey all, thanks for the replies. Give me about an hour or so and I''ll get back with you on the questions. Need to get somewhere right quick though.

Share this post


Link to post
Share on other sites
Do you really need to calculate the full array at once ?
Since you display a bitmap, you may calculate only the part that is displayed. At resolution 800*600, this is only 480000 pieces of data instead of 16 million.

Do you really need to calculate the min and max for the full map ?
- Yes if each data is mapped against a color to represent levels against an absolute (like representation of a geographical map with color indication of the sea level). (1)
- No if each data is only coded to give a relative indication of levels. (2)

(1)- You calculate the min and max for the full map, and each time you render a bitmap, you convert only the chunck you see.
float* DataArray //full array size
uchar* DispArray //full array size
GetMinMax(DataArray)
Conv2Byte(DispArray, Starting Coord, Width, Height, DataArray)
RenderBitmap(DispArray, Starting Coord, Width, Height)

(2)- You calculate the min and max for the local area and each time you render a bitmap you convert on the fly.
float* DataArray //full array size
uchar* DispArray //bitmap size only
GetMinMax(Starting Coord, Width, Height, DataArray)
Conv2Bitmap(DispArray, Starting Coord, Width, Height, DataArray)
RenderBitmap(DispArray)
If the area is scrollable, there is just a little more things to do like checking if the newly scrolled area is still within the min and max value. If Min and Max are redefined, convert back the full bitmap area.

Hope that helps.


Ghostly yours,
Red.

[edited by - Red Ghost on June 8, 2004 8:52:29 AM]

Share this post


Link to post
Share on other sites
I threw this together in a few minutes while doing a long build to simulate the described problem. It''s barely optimized at all and the transformation is quite quick (< 2 seconds on 1.7GHz pentium 3). In this case the output is:

Time to generate data 64 seconds
Time to read data 48 seconds
Time to calc min/max of data 0 seconds
Time to convert data to bytes 1 seconds

In this code the bottleneck is definately the I/O. I strongly suspect that the OP''s problem probably is IO also.



#include <algorithm>
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdlib>
#include <ctime>

void GenerateData( std::string filename, int numberOfFloats )
{
unlink( filename.c_str() );

std::ofstream output( filename.c_str() );

// First number in the file is the number of floats

output << numberOfFloats << "\n";

// Write random numbers to the file

for ( int i = 0; i < numberOfFloats; i++ )
{
float value = rand() / (float)RAND_MAX;
output << value << "\n";
}

output.close();
}

std::vector<float>* ReadData( std::string filename )
{
std::vector<float>* data = new std::vector<float>();

std::ifstream input( filename.c_str() );
int numberOfFloats = 0;

input >> numberOfFloats;

data->reserve( numberOfFloats );

float value = 0.0F;
for ( int i = 0; i < numberOfFloats; i++ )
{
input >> value;
data->push_back( value );
}

return data;
}

int
main()
{
typedef unsigned char byte;

// Generate

time_t timeStart = time( 0 );
GenerateData( "Data.dat", 16000000 );
std::cout << "Time to generate data " << (time(0) - timeStart) << " seconds\n";

// Read

timeStart = time( 0 );
std::vector<float>* floatData = ReadData( "Data.dat" );
std::cout << "Time to read data " << (time(0) - timeStart) << " seconds\n";

// Min/Max

timeStart = time( 0 );
float dataMin = *(std::min_element( floatData->begin(), floatData->end() ));
float dataMax = *(std::max_element( floatData->begin(), floatData->end() ));
std::cout << "Time to calc min/max of data " << (time(0) - timeStart) << " seconds\n";

// Tranform floats to bytes

timeStart = time( 0 );
std::vector<byte> byteData;
byteData.reserve( floatData->size() );
for ( std::vector<float>::iterator it = floatData->begin();
it != floatData->end();
++it )
{
byte value = (byte)(255.F * (*it - dataMin) / dataMax);
byteData.push_back( value );
}
std::cout << "Time to convert data to bytes " << (time(0) - timeStart) << " seconds\n";

return 0;
}

Share this post


Link to post
Share on other sites
Has the data been saved in binary format? I''m not sure if it would make any difference on performance, but try it out.

Share this post


Link to post
Share on other sites
Thanks guys, I appreciate all the feedback. Now to answer some questions.

@ Charles B:

I have to program optimization set to Full Debug...i.e., no optimizations are used. However, even after rebuilding the program with Release status (i.e. for speed), I see no change in performance.


@ Red Ghost:

I think that only doing calculations for the currently viewable area is the way to go. Originally, the program did exactly that, however it was too slow for some reason. For example, every time the user scrolled the image on the screen, the new coordinates would be sent to a handler function that would convert the appropriate area of float data to byte data and then generate the bitmap out of that. This took too long however, and the result was very laggy. It took at least a full second to convert just the 500 by 500 area or so of flat data to byte data. Therefore, I decided to calculate all the byte data and create the entire bitmap offscreen beforehand...then I could just blit this offscreen bitmap to the screen at whatever coordinates were necessary.


@ MauMan

That''s a pretty impressive float->byte conversion rate. For some reason it is taking me .8 seconds just to convert 500 x 500. Let me try using your way of converting to decrease the time and see what happens.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
About Mauman''s code, in case you are doing something similar: it is quicker to read the entire file at once than reading one float at a time. Also, you are getting to advantage from using a vector instead of an array but that doesn''t seem like a problem since calculation isn''t the bottleneck. I would do something like this.


int numFloats = 0;
fread(file, sizeof(float), 1, numFloats);

float* rawData = new float[numFloats], byteData = new float[numFloats];
fread(file, sizeof(float), numFloats, rawData);

for (int i = 0; i < numFloats; ++i)
byteData = calculate(rawData[i]);

delete rawData[];


I hope this comes out all right, there''s no preview button.

Share this post


Link to post
Share on other sites
Yeah, I''m reading all of the float data at once, with one call in fact.

FileStream->Read(floatData, dataSize);

That step only takes about .5 seconds. The good news is that I''ve discovered that only calculating max and min values and the corresponding byte data for only the viewable image has greatly increased the speed. However, having to do this every time the image is scrolled, even if the image is moved only slighty, causes too much lag still (this is the technique used originally). I''ve basically come full circle and realized that I''ve accomplished nothing.


What is the practicality of this:

At the beginning of the program, find max/min and byte data for only the viewable image and display it. Then, depending on the size of the image, create a number of threads that all work on a different section and then at the end, piece of all the data gathered by the threads together again. That is, if the image is divided into quarters, 4 threads will work on finding the max and min values of their own sections, as well as convert the float data to byte data for that section.


Just out of curiosity, how do programs like Photoshop and the like make this happen so fast?

Share this post


Link to post
Share on other sites
quote:
Original post by StonieJ
Yeah, I'm reading all of the float data at once, with one call in fact.

FileStream->Read(floatData, dataSize);

That step only takes about .5 seconds. The good news is that I've discovered that only calculating max and min values and the corresponding byte data for only the viewable image has greatly increased the speed. However, having to do this every time the image is scrolled, even if the image is moved only slighty, causes too much lag still (this is the technique used originally). I've basically come full circle and realized that I've accomplished nothing.


What is the practicality of this:

At the beginning of the program, find max/min and byte data for only the viewable image and display it. Then, depending on the size of the image, create a number of threads that all work on a different section and then at the end, piece of all the data gathered by the threads together again. That is, if the image is divided into quarters, 4 threads will work on finding the max and min values of their own sections, as well as convert the float data to byte data for that section.


Just out of curiosity, how do programs like Photoshop and the like make this happen so fast?




They do it exactly as you suggest, I beleive. If you are working on a large image in Photoshop and you move the screen a great distance, you can see the blocks being built up (as layers a collapsed to a viewable area). It works on the data in chunks. What you could do is handle the visible area and then start reading the rest of the data in a seperate thread, working from the visible area out.

After profiling your code, if you find the casts are taking too long, look into some of the fastcast methods available for your architecture. They are all over the net and do indeed speed up the conversion process.

I also wouldn't personally use multiple threads for this, unless you can take advantage of multi-processors. Having multiple threads chew through the data in parallel will simply give you cache problems and take as long if not longer than doing the chunks serially.


Out of curiousity, what's the application for?


[edited by - Sphet on June 8, 2004 12:57:28 PM]

Share this post


Link to post
Share on other sites
Well, this thread issue brings up a couple more questions. First of all, I always thought the purpose of using threads was to use more than one. Can single threads still speed up a process? In my case, the user would only be able to view the initial image until the thread loaded the rest, correct?

Share this post


Link to post
Share on other sites
You mentioned a Linux program that accomplished the same task speedily. Is it open source? If so, be sure to see if you can find the source, it might take a while to familliarize yourself with it, especially on larger projects, but I''ve always found open code quite readable. Abide by the license of course, and don''t steal the code, but there is nothing to say you can''t gain inspiration from it.

Other than that I can''t think of anything much without seeing the code, I''m afraid.

Share this post


Link to post
Share on other sites
Hi.

I am just guessing that you are using divisions in your calculation. if so, replace them all with multiplication counterparts. I am taking MauMan''s code as exmaple:


quote:
Original post by MauMan
[...]
// Write random numbers to the file
for ( int i = 0; i < numberOfFloats; i++ )
{
float value = rand() / (float)RAND_MAX;
output << value << "\n";
}
[...]


This would be much faster done like this:


float INV_RAND_MAX = 1.0f / (float)RAND_MAX;
for ( int i = 0; i < numberOfFloats; i++ )
{
float value = rand() * INV_RAND_MAX;
output << value << "\n";
}



or later in his code:

quote:

[...]
float dataMax = *(std::max_element( floatData->begin(), floatData->end() ));
[...]
for ( std::vector<float>::iterator it = floatData->begin();
it != floatData->end();
++it )
{
byte value = (byte)(255.F * (*it - dataMin) / dataMax);
byteData.push_back( value );
}
[...]




whis would be much faster like this:


[...]
float inv_dataMax = 1.0f / (*(std::max_element( floatData->begin(), floatData->end() )));
[...]
for ( std::vector<float>::iterator it = floatData->begin();
it != floatData->end();
++it )
{
byte value = (byte)(255.F * (*it - dataMin) * inv_dataMax);
byteData.push_back( value );
}
[...]


Give it a try and you will see a massive speedup.

Share this post


Link to post
Share on other sites
I just wrote a very simple test where i generated a file with 17280000 floats (gives a nice 2400*2400 image)


#include "stdafx.h"
#include "stdlib.h"
#include "math.h"
#include "stdio.h"
#include "assert.h"
#include "memory.h"
#include "time.h"
#include <iostream>

const int maxnum = 17280000;
const bool genfile = true;
const bool validate = true;

int main(int argc, char* argv[])
{
// Generating file

if (genfile)
{
time_t genstarttime = time(0);
float* buf = new float[maxnum];

for (int i=0; i<maxnum; i++)
{
buf[i] = (float)rand() / (float)RAND_MAX;
}

FILE* f = fopen("c:\\testflt.raw","wb+");
fwrite(buf,maxnum,4,f);
fclose(f);
delete[] buf;
time_t genendtime = time(0);
std::cout << "Time to generate data " << (genendtime - genstarttime) << " seconds\n";

}


// Reading data from file

time_t readstarttime = time(0);

FILE* f = fopen("c:\\testflt.raw","rb");
float* buf = new float[maxnum];
fread(buf,maxnum,4,f);
fclose(f);

time_t readendtime = time(0);
std::cout << "Time to read data " << (readendtime - readstarttime) << " seconds\n";


// Determining min/max

time_t maxstarttime = time(0);
float maxf,minf;

maxf = minf = buf[0];

for (int i=1; i<maxnum; i++)
{
float cmp = buf[i];
if (cmp < minf) minf = cmp;
if (cmp > maxf) maxf = cmp;
};

time_t maxendtime = time(0);
std::cout << "Time to determin min/max " << (maxendtime - maxstarttime) << " seconds\n";


// Converting the data

time_t convertstarttime = time(0);

typedef unsigned char uchar;

uchar* cbuf = new uchar[maxnum];
memset(cbuf,0,maxnum);

float diff = maxf - minf;
float delta = diff / 255.f;

buf[0] = 1.f;
buf[1] = 0.5f;
buf[2] = 0.f;

for (int j=0; j<maxnum; j++)
{
if(validate)
{
int test = (int)( (buf[j]-minf)/delta );
assert(test <= 255);
assert(test >= 0);
}

cbuf[j] = (uchar) ( (buf[j]-minf)/delta );
};

time_t convertendtime = time(0);
std::cout << "Time to convert data " << (convertendtime - convertstarttime) << " seconds\n";

// Writing the data

time_t writestarttime = time(0);

f = fopen("c:\\testimg.raw","wb+");
fwrite(cbuf,maxnum,1,f);
fclose(f);

time_t writeendtime = time(0);
std::cout << "Time to write data " << (writeendtime - writestarttime) << " seconds\n";

delete[] buf;
delete[] cbuf;

//wait for input to close

char c;
std::cin >> c;

return 0;
}



on my system in release mode it give me the following output:

Time to generate data 2 seconds
Time to read data 0 seconds
Time to determin min/max 0 seconds
Time to convert data 0 seconds
Time to write data 1 seconds

and that is without any particular optimisations..


granted, my system is a p4 3.0 ghz, with one gig of ram, but that still shouldn't account for that huge difference..

if you can hold the buffers in memory (in my case 67500 KB for the floating point image, and 16.875 KB for the 24 bit image) i think it should be faster..

could you run the sample program on your computer, and tell me what values you get? or am i misunderstanding what you are trying to do?

Good Luck!
Willem

[edited by - thezensunni on June 8, 2004 5:26:18 PM]

Share this post


Link to post
Share on other sites
Here were my specs:

Generate: 5
Read: 1
Determine Max/Min: 0
Convert: 8
Write: 1


Compared to what my real program is doing, those numbers are a wet dream. I''m running on a 667Mhz and those numbers are ideal for what I''m wanting. Now if I can just figure out why the hell my program is taking so much longer.

I just checked the data count for my program: 4096 x 4096...around 16 million float values. Now how in the world can it possible be taking so much longer? Man this is frustrating!

Share this post


Link to post
Share on other sites
how do 16777216 floats make a 4096*4096 image?

That would be one float for each pixel? is the result a grayscale image?
anyway, that doesn't really matter, since you would still have to convert only 16 million floats
could you post the code of your innerloop? it just shouldn't be that slow..

This is all there is to it, unless i'm still mistaken what you are trying to do.. (which is not THAT unlikely at all)

float fmin;
float fmax;

float diff = fmin - fmax;

so the whole range of 0 - 255 will be mapped onto diff

float delta = diff / 255.f;

for each float:

float fValue = (rawData-fmin)/delta;
byteData[i] = (byte) fValue;


[edited by - thezensunni on June 9, 2004 5:05:54 AM]

Share this post


Link to post
Share on other sites
I will post some of the actual code in a couple of hours (again, no time at the moment). One thing I''ve realized is that in the original program (before I came in a mangled it), the image loading code was put inside of a thread (this is a Windows GUI application, by the way, built with Borland C++Builder). However, because the user can''t interact with any part of the program anyway until the entire image is loaded and converted, I felt that using a thread was unnecessary and did away with it. Nothing needed to be done in parallel, i.e., the image MUST be loaded first and THEN the user can use it, so I felt that using a parallel structure such as a thread to perform a serial activity was meaningless. BUT, using this old code, I went in and added the max/min and conversion code within the thread and now the speed is lightning fast, as it should have been all along. I''m not an expert on how Borland works, so I''m not sure if somehow I got my program stuck in a small bind or what. During conversion, my CPU usage stays at 100% for the entire convert time...about 10 seconds. Within the thread code, it''s nearly instantaneous. Anyway, I might just chalk this one up to weird compiler gibberish and go back to the original thread code, although I don''t see why it would make a difference. Let me know if you all have any more thoughts. Thanks again.

Share this post


Link to post
Share on other sites
Well, it''s officially something wrong with my GUI and not the actual loading code. I wrote a console app that loaded in the same file that my GUI loads in, and it completed everything in about 5 seconds. I inserted the EXACT same code into the GUI program and ran it, and it took no less that 57 seconds. Something is horribly wrong with my GUI and I don''t know what. Nevertheless, I don''t have the time/patience to worry about figuring it out, since I know that by sticking it in a thread it will run in 5 seconds (or faster) like the console version. Unforunately, this brings up a slightly different problem. After completely getting rid of the threads and doing it my way, I''ve discovered that I''ve changed the program so much that it is no simple task just reverting back to the old code. Borland is great for drag-and-drop programming in some respects, but it others it just straight pisses me off.

Share this post


Link to post
Share on other sites
It might be a result of the input message queue becomming full. Because you are maxing your main thread out, the windows messages get piled up in your program''s message queue. This can cause all kinds of trouble. Any lengthy processing, even 5 seconds or more, really, you should put in a second thread so that your main app can handle messages until you are done.

Share this post


Link to post
Share on other sites
Well, maybe somebody can help me out with this really quick. It is only a syntax error, but after too many pointers start getting thrown into the mix I have a difficult time keeping everything straight. I'm having trouble setting the correct parameter for lpStartAddress as well as the receiving parameter for the function.


class TImage { .... };

int main()
{
TImage *image = new TImage();

StartThreads(image);
}


void StartThreads(TImage *image)
{
HANDLE loadThread;
loadThread = CreateThread(NULL, 0, ThreadLoadPic, (LPVOID)(&image), CREATE_SUSPENDED, NULL);

SetThreadPriority(loadThread, THREAD_PRIORITY_HIGHEST);
ResumeThread(loadThread);

WaitForSingleObject(loadThread, INFINITE);

CloseHandle(loadThread);
}

DWORD WINAPI ThreadLoadPic(LPVOID theImage)
{
// read in file and set data in the original TImage object

(TImage *)theImage->Size = "FILE_SIZE";
(TImage *)theImage->FileName = "FILE_NAME";

// etc etc etc
}


How should I be casting the lpStartAddress parameter for the CreateThread function, and how should I be receiving it and using it in the ThreadLoadPic function? Thanks.

[edited by - StonieJ on June 9, 2004 10:46:11 PM]

Share this post


Link to post
Share on other sites
Don''t raise the thread priority to highest just cause you can, first off. The scheduler will allot enough time for your processing thread.

You want:
loadThread = CreateThread(NULL, 0, ThreadLoadPic, (LPVOID)image, CREATE_SUSPENDED, NULL);

Share this post


Link to post
Share on other sites
Put the function ThreadLoadPic before StartThreads (or prototype the ThreadLoadPic function, but I prefer idea of definition is declaration) so that StartTreads doesn''t assume ThreadLoadPic is of type ''int ThreadLoadPic (void)'' (or whatever the default prototype is).

Skizz

Share this post


Link to post
Share on other sites

  • Advertisement