optimization

Started by
1 comment, last by quasar3d 20 years, 2 months ago
Hello, I''ve read a lot about optimizations and especially about pairing. So that''s why I''ve written a little testcode to test the real effort of this method: With pairing:

	cli
		mov ecx,1000000000

	lp:
		mov eax,5
		add ebx,1
		sub edx,100
		inc eax
		dec ebx

		mov edx,5
		add eax,1
		sub ebx,100
		inc edx
		dec eax

		mov ebx,5
		add edx,1
		sub eax,100
		inc ebx
		dec edx

		dec ecx
		jnz lp


		sti

I''ve also tried this loop with every instruction working on the same register:

	        mov eax,5
		add eax,1
		sub eax,100
		inc eax
		dec eax
In theory you can expect that the instructions can ALL be paired so that the optimized code would be a lot faster. In my test program, that reads the time with the rdtsc instruction, the performance gain is only 3%. The test machine is a Celeron 433 with Win98. Why is the code not as much optimalized as would be expected?
Advertisement
> with Win98
> cli
heh, Win9x is useful in some respects

Try getting rid of the movs to eax, edx, and ebx. You are effectively cutting short your ''dependency chains'' by overwriting the register''s values. To understand these effects, you need to get out of the mindset that the CPU just executes instructions as they come, with fixed latency per op. Usage of the word pairing indicates you''ve been reading Pentium optimization dox - those are long out of date

Once the processor ''sees'' the mov, it is allowed to discard all previous writes to this reg, as long as no other instructions depend on them. I have not dealt with PIIs in depth, so I don''t know how large the window is, nor how clever its data flow analysis is.

BTW, registers aren''t ''fixed'' - there are many more hidden registers that are transparently renamed. Example: add ebx, 1 translates to: 1) take value of architectural reg ebx (more on that later) 2) add 1 3) put it in another temp register ''x'' next clock 4) update mapping of arch regs - the current value of ''ebx'' is in ''x''.

Have a look at the code with your favorite profiler (not just some cheap timer thingy - something that does pipeline simulation), that should clear things up
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Try aligning the loop label to a 16 byte boundary - this can make a difference on some machines.
    .align 16lp:    mov eax, 5    ...    dec ecx    jnz lp 

This topic is closed to new replies.

Advertisement