2D Line drawing algo

Started by
24 comments, last by jho 23 years, 5 months ago

Example of optimization (sorry to post twice but I can''t get the post to work with two [ source ] statements)

    do{   *(pDst + (yaccum>>16) + (xaccum>>16)) = color;   xaccum += xinc;   yaccum += yinc;   *(pDst + (yaccum2>>16) + (xaccum2>>16)) = color;   xaccum2 += xinc;   yaccum2 += yinc;}while( --nLineLengthDiv2 );    


...or something like this to fill out the pipeline. There will only be one missed branch prediction per line drawn using this method. Also assumes all drawn lines are at a minimum a dot.

Branch prediction in Bresenhams will still (on average, with random line orientations) get it wrong 25% of the time (worst case just under 50%, best case just over 0%).

Advertisement
quote:Original post by neocron

You could use floats instead (so you don''t have to do the two right shifts) but remember to use an optimized cast and NOT just (int) or floor() to get the integer value of the screen (x,y) coordinate.


''Optimized cast?''

-Ironblayde
 Aeon Software

The following sentence is true.
The preceding sentence is false.
"Your superior intellect is no match for our puny weapons!"

Optimized cast: just a faster way to get, for exapmple, an int from a float. Here''s some code:

    inline int FloatToInt( float a ){   int b;   _asm   {      fld a;      fistp b;   }   return b;}    
Only read the first few posts in detail, but what about getting rid of some comparisons using running lengths of pixels in the middle...
For instance, if there are running lengths of 3 then 4 pixels in the middle of the line (middle being everywhere but the end points), then always draw 3 pixels, then test the last one, until you reach the end. (the start and end could have a running length different then 3 or 4 pixels in this case... it just depends...)

quote:Original post by SiCrane

Also, you really do not want to do lookup tables for line drawing. The cache miss costs of the lookup table colliding with the actual surface buffer would not be worth the few (if any) actual processor cycles saved.


Isn''t it true that the cache used for the relatively small lookup table (screenwidth 1024 * int = 4096 bytes) is just a small inpact on the total amount of cache today.
Since the lookup table is used for every line - this lookup table should be in the cache after the first few lines. I don''t see any cache misses after that...

What am I missing here? Did I fail my cache course?

Ironblayde: If you already know the basics of how a processor works (i.e. stalls, forwarding, pipelining, etc.) basically just read the technical specifications and commentaries.

baskuenen: Maybe I''m not understanding your lookup table, but with only 1024 entries, it seems that you can only store the recipricols of the run length, which means you still have to do a multiply instruction in order to get the increment value. So the combination of the load with the multiply would seem to be worse compared to a straight divide, even if there is no cache miss. As I originally understooded it, it sounded like you wanted a lookup table the size of the screen (1024 x 768) which obviously would have major major cache problems.

But let''s assume that I''m just misunderstanding your lookup table and it works efficiently with regards to best case instruction time. Let''s take the a PIII Katmai, it has a 16KB 4 way associative L1 data cache. If you want to load in a 4 KB lookup table, you''ve taken up a cache line in each and every set. If we have a 32 bit framebuffer at 1024 x 768, then each horizontal scanline will similarly consume an a cache line in each and every set in the cache. Because of the structure of a frame buffer we know that all the pixels along a verticle line would fall in the same cache set. Now let''s draw a verticle line straight from the top of the screen to the bottom. We''ve just loaded 768 different cache lines in and out of one set in the L1 cache. Not only that, but the L2 cache on a PIII Katmai is only 256 KB and 8-way associative. So whatever that was in the L1 cache flushed, and it''s also no longer is sitting in the L2 cache either. So when you try to use any part of the lookup table that sat in the that set, you''ll have to do a fetch from main memory, which takes about 115 clock cycles. Not only will pure verticle lines do this, but many near verticle lines, as horizontally adjacent pixels will also tend to fall in the same cache line. Also many lines with a slope somewhere between verticle and between 1, will cause entire cache sets to be flushed, which will induce the less costly L1 cache miss on table lookup, which is still on the order of 10 additional cycles. And after three or four near horizontal line calls, it seems likely that previously unused elements of the lookup table will also be flushed from the L1 cache. So it seems likely that for random lines, the you''ll almost always have to load the lookup table value from the L2 cache or worse.

Also things are more grim if you consider the P4, which only has a 8 KB L1 cache, or the PII where the cache miss penalities are much higher.
quote:Original post by SiCrane

baskuenen: Maybe I''m not understanding your lookup table, but with only 1024 entries, it seems that you can only store the recipricols of the run length, which means you still have to do a multiply instruction in order to get the increment value. So the combination of the load with the multiply would seem to be worse compared to a straight divide, even if there is no cache miss. As I originally understooded it, it sounded like you wanted a lookup table the size of the screen (1024 x 768) which obviously would have major major cache problems.

But let''s assume that I''m just misunderstanding your lookup table and it works efficiently with regards to best case instruction time. Let''s take the a PIII Katmai, it has a 16KB 4 way associative L1 data cache. If you want to load in a 4 KB lookup table, you''ve taken up a cache line in each and every set. If we have a 32 bit framebuffer at 1024 x 768, then each horizontal scanline will similarly consume an a cache line in each and every set in the cache. Because of the structure of a frame buffer we know that all the pixels along a verticle line would fall in the same cache set. Now let''s draw a verticle line straight from the top of the screen to the bottom. We''ve just loaded 768 different cache lines in and out of one set in the L1 cache. Not only that, but the L2 cache on a PIII Katmai is only 256 KB and 8-way associative. So whatever that was in the L1 cache flushed, and it''s also no longer is sitting in the L2 cache either. So when you try to use any part of the lookup table that sat in the that set, you''ll have to do a fetch from main memory, which takes about 115 clock cycles. Not only will pure verticle lines do this, but many near verticle lines, as horizontally adjacent pixels will also tend to fall in the same cache line. Also many lines with a slope somewhere between verticle and between 1, will cause entire cache sets to be flushed, which will induce the less costly L1 cache miss on table lookup, which is still on the order of 10 additional cycles. And after three or four near horizontal line calls, it seems likely that previously unused elements of the lookup table will also be flushed from the L1 cache. So it seems likely that for random lines, the you''ll almost always have to load the lookup table value from the L2 cache or worse.

Also things are more grim if you consider the P4, which only has a 8 KB L1 cache, or the PII where the cache miss penalities are much higher.


SiCrane, thanks for your explaination. I still cannot agree with you

First I''m assuming we''re reading the pixel first before writing - else it wont be loaded into cache.
Second, for a maximum length vertical line, you have 768 pixels. Every cache line is 32bytes. 32*768=24K cache needed. You''re right when we have a small cache. (I assumed we had atleast 128K, like the "old" Pentiums). Isn''t the cache nowadays not getting smarter? Like overwrite the least used mem? I''m not sure about this though.
The third thing - using a 1024x768 lookup table is a bit overdoing it. I would suggest a lookup table containing the values of the 1/DeltaWidth or 1/DeltaHeight. And yes - we would need an extra multiply to put this value to use though (dx/dy or dy/dx).

Just some code - not really important
    FLD  LookupTbl[eax*4] ;eax=dx ;1 cycleFMUL [dy] ;1 cycle; do some integer thing ;1 cycle loss due to FISTPFISTP [result] ; 3 cycles (or was it 5?)MOV eax, [result]; paired instruction 1/2 cycle for loading in register again; total of 6.5 CPU cycles needed for calculation    



If the cache is only 8K in size, using this cache for drawing (including read before write) would make it real slow.
If we don''t use the cache for writing to the screen, the bit lost to our lookup table is acceptable.

Am I wrong?

Is it true the size of the cache is decreasing in newer computers? Why? To make a small bit faster, or the costs?

Some last words:
One thing is for sure: If your lines are very short - Bresenham is your solution!

The L1 cache has always been minuscile. The L2 cache is what you are thinking about when you say the old pentiums had 128KB. However, what''s really relevent is the cache organization. The L1 cache on a PIII Katmai core is 4 way set associative. Which means that any given memory location is limited to one of four cache lines. And while the cache lines are flushed in an approximation of the least most recently used algorithm, if a verticle line is drawn, it will try to load all 768 pixels in the same 4 cache lines. One of the cache lines may be preserved by the stack access in the best case, however the cache line that referenced the lookup table will most definately be flushed from the L1 cache. Even if the cache line held the currently used denominator, because the value actually used is located on the stack, not used directly from the lookup table. The L2 cache on the Katmai is 8 way associative, and again the pixels along the verticle line will share the same mapping, so all 768 will try to be loaded to the same 8 cache lines. Again one line will probably saved from being flushed by the stack, however the line that held the lookup table will be flushed as 768 cache lines compete for the remaining cache lines. If either cache was fully associative, which it seems like you''re assuming, then we wouldn''t have this problem, but fully associative caches are *slow*. (Incidently this associativity problem is why the last batch of celerons perform so poorly. They have half the L2 cache of the equivelent PIII''s, and it''s only 2 way associative.)

And it''s not like you can stop the cache from being loaded on access to the framebuffer for writing. While the PIII core uses a nice write back scheme, which gives write instructions essentially no overheaded (oversimplification), the cache is still loaded with the line written for coherence. (I''m assuming that Pentium four will be similar, but it''s not like I''ve gotten the chance to play with one yet.)

And cache sizes on the whole have been going up over the years, as the price and performance of transitors have improved. However, the Pentium four engineers decided that they needed to decrease cache latency on the L1 cache, so the L1 data cache on the P4 is being dropped to 8KB, which drops the access time from 3 cycles to 2 cycles. Considering how much time x86 programs spend moving data to and from local variables on the stack, this isn''t necessarily a bad thing. However, AMD engineers didn''t have the same opinion, so the T-Bird has a nice happy 64 KB L1 cache.
Ok - You got me here. This is going a little over my head, and this doesn't happen often

I stopped coding in asm around the time the PII arrived. My knowledge on cache goes about that far. I had a class a year ago called "Computer Architecture" that was about the stuff you are talking about. I don't remember anything of it anymore. Maybe I should have paid more attention?

Anyway, SiCrane: where did you get that knowledge? Is there a nice practical article somewhere where I can upgrade my knowledge about this? I don't see myself doing assembly anymore, but it helps me make good C decisions. I don't find it interresting enough to read loads on it, but a nice small practical article would be fun.

Who are you SiCrane? It doesn't happen that often people know more than me! (I hope ) You've got my respect here

(to jho: So now you know to draw a line! NOT! Sorry about this )



Edited by - BasKuenen on November 2, 2000 6:21:15 PM

I''m not sure how a lookup table will help in line drawing. Could someone give me an example.

This topic is closed to new replies.

Advertisement