Sign in to follow this  
derek7

vector is very slow

Recommended Posts

	                        std::vector<DWORD> pixellist(1024 * 1024);
				D3DLOCKED_RECT  lockrect;
				
				destsurface->LockRect(&lockrect, 0, 0);
				
				DWORD *dword= (DWORD*)lockrect.pBits;
				
				for(int i =0;i<desc.Height;i++)
				{
					for(int j=0;j<desc.Width;j++)
					{
 
						int index = i*lockrect.Pitch/4 + j;

						pixellist[index] = dword[index];
						

					}
				}

pixellist[index] is called many time. it is very slow. if you use DWORD pixellist[index] , it is very quick. where is the problem?

Share this post


Link to post
Share on other sites
Well, if its "very" slow, then chances are you're working in Debug mode, which means that std::vector::operator[] will be checking your bounds and doing many other things to ensure that you don't do something stupid, like access outside of the bounds of the vector.

Share this post


Link to post
Share on other sites
Why don't you use new to allocate the DWORDS and then use a series of memcpy() calls to fill it in row by row. If your lucky the compiler will inline the memcpy calls because the compiler can treat these as intrinsic functions.

On the plus side, you remove one loop and I think the compiler will move up to 4 bytes at a time.

Edit: Also, if you align the staring address of your pixel list, you can gain a little more performance because every write will be on an aligned address.

Share this post


Link to post
Share on other sites
In general STL with optimizations off (especially inlining), as in debug mode, there is considerable overhead and your program will run really, really slow even doing the simplest things. Turn optimizations on and most of the overhea goes away.

Quote:
Original post by Washu
std::vector::operator[] will be checking your bounds and doing many other things to ensure that you don't do something stupid, like access outside of the bounds of the vector.


Depends on implementation. The C++ standard only requires vector<T>::at() to do bounds checking, not necessarily vector<T>::operator[]. Doing v[i] with i > v.size() is undefined behavior.

While some STL implementaton just make [] the same function as .at(), some cut corners and don't, so then [] is just as efficient as unchecked C-arrays.

Share this post


Link to post
Share on other sites
It would be a pretty lame implementation if it didn't do bounds checking in operator[] in debug mode. And since the implementations that come with both VC++ and GCC do it, it's pretty same to assume his does too.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
It's clearly the division you've got inside your index calculation, move this outside of your loops, it's a complete waste of cpu cycles to calculate it ever and ever again.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
It's clearly the division you've got inside your index calculation, move this outside of your loops, it's a complete waste of cpu cycles to calculate it ever and ever again.


A division by 4 is likely to be optimized to a bitshift, which is unlikely to be a great cause of performance loss (one cycle for every iteration).

Share this post


Link to post
Share on other sites
Quote:
Original post by Thelo
Quote:
Original post by Anonymous Poster
It's clearly the division you've got inside your index calculation, move this outside of your loops, it's a complete waste of cpu cycles to calculate it ever and ever again.


A division by 4 is likely to be optimized to a bitshift, which is unlikely to be a great cause of performance loss (one cycle for every iteration).
Calculating something that is loop invariant more than a single time would be a waste of CPU time. Using a compiler that doesn't optimize such things and/or optimizing it manually would be a waste of programmer time. Trying to optimize without profiling would also be a complete waste of programmer time, and considering such details as to whether it's implemented as a multiply, a shift, a load effective address, or anything else is a worse waste of programmer time.

Share this post


Link to post
Share on other sites
Is that run every frame? In that case, you might want to make the vector static... your will allocate new memory every time the function is run.

Share this post


Link to post
Share on other sites
I've done some benchmarks using similar code to the above, and the numbers are about in these proportions:

Using the original method: 460~480
Putting the vector static: 38~46
Using a pure array: 26~28

Putting the division inside the loop or outside it doesn't affect anything.

Since this is a very large list, that this code snippet will probably be used pretty often and potentially be a performance bottleneck (as implied by the names referring to pixels and surfaces), and that the size of the vector/array will be known at creation and fixed for the duration of the operation (again, it's a list of the pixels of a given surface), I'd simply recommend using a pure array in this situation.

Share this post


Link to post
Share on other sites
Quote:
Original post by Thelo
I've done some benchmarks using similar code to the above, and the numbers are about in these proportions:

Using the original method: 460~480
Putting the vector static: 38~46
Using a pure array: 26~28

Putting the division inside the loop or outside it doesn't affect anything.

Since this is a very large list, that this code snippet will probably be used pretty often and potentially be a performance bottleneck (as implied by the names referring to pixels and surfaces), and that the size of the vector/array will be known at creation and fixed for the duration of the operation (again, it's a list of the pixels of a given surface), I'd simply recommend using a pure array in this situation.


Can we see your test code please, I want to try it. I'm curious as to how you got such radically different results, seeing as here the results aren't off by much, and I managed to get almost equal values by ramping up optimisations.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by rip-off
Can we see your test code please, I want to try it. I'm curious as to how you got such radically different results, seeing as here the results aren't off by much, and I managed to get almost equal values by ramping up optimisations.
One difference between the vector and a dynamic array is that the vector as shown here default-initializes its elements to 0 while a dynamic array doesn't. Since the following loop is a simple operation, a memcpy, the complulsory memset for vectors might affect that much to the performance.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Anonymous Poster
Quote:
Original post by rip-off
Can we see your test code please, I want to try it. I'm curious as to how you got such radically different results, seeing as here the results aren't off by much, and I managed to get almost equal values by ramping up optimisations.
One difference between the vector and a dynamic array is that the vector as shown here default-initializes its elements to 0 while a dynamic array doesn't. Since the following loop is a simple operation, a memcpy, the complulsory memset for vectors might affect that much to the performance.
Oh wait, I was comparing the numbers for "static vector" and "pure array".. memset definitely shouldn't take 20 times as long as memcpy ;).

Anyway, another difference is that stack-allocation is faster than the heap allocation that vector must use. But definitely shouldn't be by that much

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Quote:
Original post by Anonymous Poster
Quote:
Original post by rip-off
Can we see your test code please, I want to try it. I'm curious as to how you got such radically different results, seeing as here the results aren't off by much, and I managed to get almost equal values by ramping up optimisations.
One difference between the vector and a dynamic array is that the vector as shown here default-initializes its elements to 0 while a dynamic array doesn't. Since the following loop is a simple operation, a memcpy, the complulsory memset for vectors might affect that much to the performance.
Oh wait, I was comparing the numbers for "static vector" and "pure array".. memset definitely shouldn't take 20 times as long as memcpy ;).

Anyway, another difference is that stack-allocation is faster than the heap allocation that vector must use. But definitely shouldn't be by that much


Hence my curiousity :)

Share this post


Link to post
Share on other sites
Quote:
Original post by rip-off
Can we see your test code please, I want to try it. I'm curious as to how you got such radically different results, seeing as here the results aren't off by much, and I managed to get almost equal values by ramping up optimisations.

Sure. Here's what I used:

typedef struct {
int foo, bar, baz;
int Pitch;
} tLockrect;
DWORD pixellist2[1024*1024];

void TestFunction() {
std::vector<DWORD> pixellist(1024 * 1024);
tLockrect lockrect;
lockrect.Pitch = 24;
DWORD dword[8] = {8, 7, 6, 5, 4, 3, 2, 1};
int denom = lockrect.Pitch/4;
for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist[index] = dword[index % 8];
}
}
}

void TestFunction2() {
static std::vector<DWORD> pixellist(1024 * 1024);
tLockrect lockrect;
lockrect.Pitch = 24;
DWORD dword[8] = {8, 7, 6, 5, 4, 3, 2, 1};
int denom = lockrect.Pitch/4;
for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist[index] = dword[index % 8];
}
}
}

void TestFunction3() {
tLockrect lockrect;
lockrect.Pitch = 24;
DWORD dword[8] = {8, 7, 6, 5, 4, 3, 2, 1};
for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist2[index] = dword[index % 8];
}
}
}




Then, to actually test these out (I put this in a sandbox Win32 application, hence the SetDlgItemText line)

int i;
DWORD startTime = timeGetTime();
for (i = 0; i < 50; i++) {
TestFunction();
}
DWORD endTime = timeGetTime();
SetDlgItemText(dialogHwnd, IDC_EDIT1, int2str(endTime - startTime).c_str());




The whole thing was compiled with Visual Studio 2003, in release mode, without managed extensions. I used a global array instead of a local one to prevent the compiler from optimizing the whole operation away.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
If it's always 1024*1024 you could even use a stack carray und increase the stack size in your compiler settings.

Share this post


Link to post
Share on other sites
Quote:
Original post by Extrarius
Quote:
Original post by Thelo
Quote:
Original post by Anonymous Poster
It's clearly the division you've got inside your index calculation, move this outside of your loops, it's a complete waste of cpu cycles to calculate it ever and ever again.


A division by 4 is likely to be optimized to a bitshift, which is unlikely to be a great cause of performance loss (one cycle for every iteration).
Calculating something that is loop invariant more than a single time would be a waste of CPU time. Using a compiler that doesn't optimize such things and/or optimizing it manually would be a waste of programmer time. Trying to optimize without profiling would also be a complete waste of programmer time, and considering such details as to whether it's implemented as a multiply, a shift, a load effective address, or anything else is a worse waste of programmer time.


While I'm totally with you that optimization should be done after a profiling session, I believe its good practice to move out of loops calculations that are not affected by this loop. It makes the code clearer in my opinion. For the same reason I'd rather do ++itor instead of itor++ and then optimize it after finding out I'm wasting too many cycles iterating through a list, that's the kind of thing that's really easy to do the first time.

Share this post


Link to post
Share on other sites
Did any of you optimizers consider the possibility that the copied data doesn't fill the entire container (but a sub-rectangle), and that the other values do need to be zeroed out?

Anyway, the loops can be replaced with:


for (int i = 0; i < desc.Height; ++i) {
int index = i * lockrect.Pitch / 4;
std::copy(dword + index, dword + index + desc.Width, pixellist.begin() + index);
}


For DWORDs, the std::copy call is likely to be transformed into memcpy or memmove behind the scenes.

Of course, if you feel like doing the compiler's job for it, or showing off, you *could* manage the loop invariants yourself:


// Note, I'm not at all sure I got this right. Which is why you shouldn't do these things.
const int stride = lockrect.Pitch / 4;
// I assume the Pitch is divisible by 4
DWORD* last = desc.Height * stride + dword;
for (DWORD* i = &pixellist[0], j = dword; j < last; i += stride, j += stride) {
std::copy(j, j + desc.Width, i);
}

Share this post


Link to post
Share on other sites
Here is my test. I added and changed some things, I hope you don't mind though, they shouldn't affectthe test.


struct tLockrect
{
int foo, bar, baz;
int Pitch;
};

// made it global to make the functions easier
DWORD dword[8] = {8, 7, 6, 5, 4, 3, 2, 1};

// normal vector test

void TestFunction1()
{
std::vector<DWORD> pixellist1(640 * 480);

tLockrect lockrect;
lockrect.Pitch = 24;

for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist1[index] = dword[index % 8];
}
}
}

// static vector test

void TestFunction2()
{
static std::vector<DWORD> staticpixellist(640 * 480);
tLockrect lockrect;
lockrect.Pitch = 24;

for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
staticpixellist[index] = dword[index % 8];
}
}
}

// normal array test

void TestFunction3()
{
DWORD pixellist2[640*480];
tLockrect lockrect;
lockrect.Pitch = 24;

for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist2[index] = dword[index % 8];
}
}
}

// dynamic array test

void TestFunction4()
{
DWORD *pixellist4 = new DWORD[640*480];

tLockrect lockrect;
lockrect.Pitch = 24;

for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist4[index] = dword[index % 8];
}
}

delete pixellist4;
}

// global array test

DWORD pixellist5[640*480];

void TestFunction5()
{

tLockrect lockrect;
lockrect.Pitch = 24;

for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist5[index] = dword[index % 8];
}
}
}

// global vector test

std::vector<DWORD> pixellist6(640 * 480);

void TestFunction6()
{

tLockrect lockrect;
lockrect.Pitch = 24;

for(int i =0;i<640;i++) {
for(int j=0;j<480;j++) {
int index = i*lockrect.Pitch/4 + j;
pixellist6[index] = dword[index % 8];
}
}
}

#define test(x) { \
startTime = SDL_GetTicks(); \
for( int i = 0 ; i < 50 ; ++i )\
{\
TestFunction##x();\
}\
t##x##time += SDL_GetTicks() - startTime; \
}


void runTests()
{
DWORD t1time = 0, t2time = 0, t3time = 0, t4time = 0, t5time = 0, t6time = 0;
for( int j = 0; j < 50 ; ++j )
{
DWORD startTime;
test(1);
test(2);
test(3);
test(4);
test(5);
test(6);
}
cout << "time for test1 was " << t1time << std::endl;
cout << "time for test2 was " << t2time << std::endl;
cout << "time for test3 was " << t3time << std::endl;
cout << "time for test4 was " << t4time << std::endl;
cout << "time for test5 was " << t5time << std::endl;
cout << "time for test6 was " << t6time << std::endl;
}






Some sample results:


time for test1 was 12950
time for test2 was 2490
time for test3 was 2470
time for test4 was 9873
time for test5 was 2887
time for test6 was 2904

time for test1 was 12974
time for test2 was 2489
time for test3 was 2474
time for test4 was 9885
time for test5 was 2892
time for test6 was 2870

time for test1 was 12927
time for test2 was 2481
time for test3 was 2492
time for test4 was 9811
time for test5 was 2899
time for test6 was 2902

time for test1 was 12955
time for test2 was 2483
time for test3 was 2510
time for test4 was 9869
time for test5 was 2896
time for test6 was 2898

time for test1 was 12926
time for test2 was 2483
time for test3 was 2481
time for test4 was 9828
time for test5 was 2881
time for test6 was 2869






However, there is the issue of scope... Only the functions here that use globals actaully have output. You cannot return stack variables, so here I think passing a vector reference would be the "fastest" way of doing something actaully useful. A static array of known size is very limiting, why not use a vector instead, to give you more options later on.

After all, this test shows that the main speed problems were caused by the obtaining, releasing and ( not demonstrated but probably ) the zeroing of the elements.

Also, in your test you overallocated the vector, compared to the amount of space used. The vectors ctor must zero that extra space, so that biases your results. Also, making the array a global instead of a stack variable also changes the results.

Share this post


Link to post
Share on other sites
Just a quick note for those interested:

In the latest STL implementation by MS in VC++ 2005, bounds checking is done even in release builds by default. To disable this behavior, you must #define _SCL_SECURE to be 0 before including any STL headers.

Share this post


Link to post
Share on other sites

#include <windows.h>
#include <sstream>
#include <string>
#include <vector>
#pragma comment(lib,"winmm")

struct struct1
{
int i;
float f;
char c;
};
void main()
{
float pretime = timeGetTime();
for (int i=0;i<100000;i++)
{
std::vector<int > *v = new std::vector<int>(100);// it take only 700ms
std::vector<int > v(100)// it take 900ms
}
float deltatime = (timeGetTime() - pretime);
std::stringstream c;
std::string s;
c<<deltatime;
s = c.str();
::MessageBox(0,s.c_str(),0,0);
}



look at this , why dynamic allocate fast? it run under debug build

Share this post


Link to post
Share on other sites
I am a bit confused. Shouldn't the non dynamically allocated vector be allocated once while the dynamically allocated vector be allocated 100000 times.

Share this post


Link to post
Share on other sites
Quote:

look at this , why dynamic allocate fast? it run under debug build

You didn't issue a 'delete' in there. Meanwhile the stack push and pop every loops.

Share this post


Link to post
Share on other sites
Quote:
Original post by penwan
Just a quick note for those interested:

In the latest STL implementation by MS in VC++ 2005, bounds checking is done even in release builds by default. To disable this behavior, you must #define _SCL_SECURE to be 0 before including any STL headers.


I don't know what to say to this.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this