Archived

This topic is now archived and is closed to further replies.

jho

2D Line drawing algo

Recommended Posts

jho    122
I''m looking for a 2D line drawing algorithm, is Bresenham line the fastest? Also, I was wondering how I can implement a line thicker than 1 pixel thick.

Share this post


Link to post
Share on other sites
baskuenen    124
Yep - Bresenham is the fastest.

For thicker lines - why don''t you plot an extra pixel - just above/below/left/right an original Bresenham pixel

Share this post


Link to post
Share on other sites
Ironblayde    130
The original Bresenham''s algorithm runs a bit faster if you use a symmetric implementation -- that is, start at both ends and work towards the middle. You''re only dealing with the error term for half of the length of the line that way. Also, you might want to look up the Symmetric Double Step algorithm for something different.

-Ironblayde
 Aeon Software

The following sentence is true.
The preceding sentence is false.

Share this post


Link to post
Share on other sites
jho    122
Can you point me to a URL with optimized source code that works for lines of any slope? I have done this few years ago but I can''t find my code anymore..

Share this post


Link to post
Share on other sites
neocron    122

It would be interesting to compare a DDA line algorithm to Bresenham''s on today''s processors. With DDA you don''t have to do that exrta comparison in the loop so you can remove a potential pipeline stall. Fixed point or float would be another optimizing option. Also you can unroll the DDA by accumulating n variables and n-stepping them (keeping them independent of each other) again to keep the pipeline full.

Additionally I think DDA is a little easier to implement.

What do you think?

Share this post


Link to post
Share on other sites
baskuenen    124
I must come back on my previous post.

I think neocron is right when he says bresenham might not be the fastest now, due to the slow compare function nowadays.

Since there''s a limit to the width & height of your screen, it''s easy to make lookup tables for your divides needed for pre-calculating your interpolation values.

jho: What''s your programming language, and what do you need it for? How fast must it be, and are you prepared to go into assembly?

Share this post


Link to post
Share on other sites
jho    122
baskuenen: I want source code in C/C++, but I can read pascal just fine. No ASM because it is for a multi platform project.

It''s not for a homework assignment if you''re worry about that! I''m just trying to get rid of my GDI calls right now and use my own pixel manipulation to compare performance.


Share this post


Link to post
Share on other sites
SiCrane    11839
The DDA algorithm still tends to be slower because of the large clock delay in the execution of FMUL instructions. (And it''ll probably be even worse when the P4''s come out) Whereas the all-integer instruction set of a Bresenham implementation pipelines very nicely, with low cycle delay. Also branch prediction and speculative execution pretty much eliminate the penalty of the compare vs. any other integer instruction. However, these same developments have very much reduced the benefits of the symetric and multi-step line drawing algorithms, as the memory motion costs begin to dominate over actual actual execution costs, especially in the reduced register set of an x86 processor. Also, you really do not want to do lookup tables for line drawing. The cache miss costs of the lookup table colliding with the actual surface buffer would not be worth the few (if any) actual processor cycles saved.

For some hard numbers the following simulation was run:
On PII-300 running a virtual 640x480x32 screen (i.e. a segment of memory 640x480x4 bytes large not actually hooked to a screen buffer of any sort), several thousand random lines were generated, with DDA, Bresenham and Symmetric Two Step. The seed was saved so that the same lines were drawn for each algorithm. The results were run three times and the times for each were averaged. The DDA algorithm took an average of 163 ms per trial, The Bresenham algorithm took 63ms, and Symmetric Two step took 70ms. (The resolution of the timer was only approximately 10ms.) Generating generating the random numbers with a call to a stub function took average of 13ms.

Share this post


Link to post
Share on other sites
Ironblayde    130
That''s some interesting stuff SiCrane; I always wonder about things like that. I''m familiar with caching, branch prediction, pipelining, etc. but I learned assembly on the MIPS architecture so I don''t know how any of it is implemented for Intel processors. Are there any books you recommend for learning that sort of thing?

-Ironblayde
 Aeon Software

The following sentence is true.
The preceding sentence is false.

Share this post


Link to post
Share on other sites
neocron    122

Hi all,

Here''s a snippet of code from my DDA line inner loop. ''xinc'' is in fixed point 16.16 ''yinc'' is in fixed point 16.16 and includes the width of the screen (so no muls). Of course you have to determine the x-major, or y-major orientation then the dy/dx (or dx/dy) before using this loop (just like bresenham).

    
for( int ix=0; ix<nLineLen; ix++ )
{
// Set the (x,y) point on the screen to color
*(pDst + (yaccum>>16) + (xaccum>>16)) = color;
xaccum += xinc;
yaccum += yinc;
}


You can see that you can add an xaccum2, yaccum2, and multiply xinc and yinc by two to get a better pipeline fill (as well as effectively unrolling the loop - remember to inc ix by two and account for odd line lengths)

You could use floats instead (so you don''t have to do the two right shifts) but remember to use an optimized cast and NOT just (int) or floor() to get the integer value of the screen (x,y) coordinate.

Additionally you could MIX float and int to get more parallelism and hopefully more speed. If anyone has an example of this I''d love to see it.

Hope this helps,
Karen









Share this post


Link to post
Share on other sites
neocron    122

Example of optimization (sorry to post twice but I can''t get the post to work with two [ source ] statements)

    
do
{
*(pDst + (yaccum>>16) + (xaccum>>16)) = color;
xaccum += xinc;
yaccum += yinc;
*(pDst + (yaccum2>>16) + (xaccum2>>16)) = color;
xaccum2 += xinc;
yaccum2 += yinc;
}
while( --nLineLengthDiv2 );


...or something like this to fill out the pipeline. There will only be one missed branch prediction per line drawn using this method. Also assumes all drawn lines are at a minimum a dot.

Branch prediction in Bresenhams will still (on average, with random line orientations) get it wrong 25% of the time (worst case just under 50%, best case just over 0%).

Share this post


Link to post
Share on other sites
Ironblayde    130
quote:
Original post by neocron

You could use floats instead (so you don''t have to do the two right shifts) but remember to use an optimized cast and NOT just (int) or floor() to get the integer value of the screen (x,y) coordinate.



''Optimized cast?''

-Ironblayde
 Aeon Software

The following sentence is true.
The preceding sentence is false.

Share this post


Link to post
Share on other sites
neocron    122

Optimized cast: just a faster way to get, for exapmple, an int from a float. Here''s some code:

    
inline int FloatToInt( float a )
{
int b;
_asm
{
fld a;
fistp b;
}
return b;
}

Share this post


Link to post
Share on other sites
Thrump    169
Only read the first few posts in detail, but what about getting rid of some comparisons using running lengths of pixels in the middle...
For instance, if there are running lengths of 3 then 4 pixels in the middle of the line (middle being everywhere but the end points), then always draw 3 pixels, then test the last one, until you reach the end. (the start and end could have a running length different then 3 or 4 pixels in this case... it just depends...)

Share this post


Link to post
Share on other sites
baskuenen    124
quote:
Original post by SiCrane

Also, you really do not want to do lookup tables for line drawing. The cache miss costs of the lookup table colliding with the actual surface buffer would not be worth the few (if any) actual processor cycles saved.



Isn''t it true that the cache used for the relatively small lookup table (screenwidth 1024 * int = 4096 bytes) is just a small inpact on the total amount of cache today.
Since the lookup table is used for every line - this lookup table should be in the cache after the first few lines. I don''t see any cache misses after that...

What am I missing here? Did I fail my cache course?

Share this post


Link to post
Share on other sites
SiCrane    11839
Ironblayde: If you already know the basics of how a processor works (i.e. stalls, forwarding, pipelining, etc.) basically just read the technical specifications and commentaries.

baskuenen: Maybe I''m not understanding your lookup table, but with only 1024 entries, it seems that you can only store the recipricols of the run length, which means you still have to do a multiply instruction in order to get the increment value. So the combination of the load with the multiply would seem to be worse compared to a straight divide, even if there is no cache miss. As I originally understooded it, it sounded like you wanted a lookup table the size of the screen (1024 x 768) which obviously would have major major cache problems.

But let''s assume that I''m just misunderstanding your lookup table and it works efficiently with regards to best case instruction time. Let''s take the a PIII Katmai, it has a 16KB 4 way associative L1 data cache. If you want to load in a 4 KB lookup table, you''ve taken up a cache line in each and every set. If we have a 32 bit framebuffer at 1024 x 768, then each horizontal scanline will similarly consume an a cache line in each and every set in the cache. Because of the structure of a frame buffer we know that all the pixels along a verticle line would fall in the same cache set. Now let''s draw a verticle line straight from the top of the screen to the bottom. We''ve just loaded 768 different cache lines in and out of one set in the L1 cache. Not only that, but the L2 cache on a PIII Katmai is only 256 KB and 8-way associative. So whatever that was in the L1 cache flushed, and it''s also no longer is sitting in the L2 cache either. So when you try to use any part of the lookup table that sat in the that set, you''ll have to do a fetch from main memory, which takes about 115 clock cycles. Not only will pure verticle lines do this, but many near verticle lines, as horizontally adjacent pixels will also tend to fall in the same cache line. Also many lines with a slope somewhere between verticle and between 1, will cause entire cache sets to be flushed, which will induce the less costly L1 cache miss on table lookup, which is still on the order of 10 additional cycles. And after three or four near horizontal line calls, it seems likely that previously unused elements of the lookup table will also be flushed from the L1 cache. So it seems likely that for random lines, the you''ll almost always have to load the lookup table value from the L2 cache or worse.

Also things are more grim if you consider the P4, which only has a 8 KB L1 cache, or the PII where the cache miss penalities are much higher.

Share this post


Link to post
Share on other sites
baskuenen    124
quote:
Original post by SiCrane

baskuenen: Maybe I''m not understanding your lookup table, but with only 1024 entries, it seems that you can only store the recipricols of the run length, which means you still have to do a multiply instruction in order to get the increment value. So the combination of the load with the multiply would seem to be worse compared to a straight divide, even if there is no cache miss. As I originally understooded it, it sounded like you wanted a lookup table the size of the screen (1024 x 768) which obviously would have major major cache problems.

But let''s assume that I''m just misunderstanding your lookup table and it works efficiently with regards to best case instruction time. Let''s take the a PIII Katmai, it has a 16KB 4 way associative L1 data cache. If you want to load in a 4 KB lookup table, you''ve taken up a cache line in each and every set. If we have a 32 bit framebuffer at 1024 x 768, then each horizontal scanline will similarly consume an a cache line in each and every set in the cache. Because of the structure of a frame buffer we know that all the pixels along a verticle line would fall in the same cache set. Now let''s draw a verticle line straight from the top of the screen to the bottom. We''ve just loaded 768 different cache lines in and out of one set in the L1 cache. Not only that, but the L2 cache on a PIII Katmai is only 256 KB and 8-way associative. So whatever that was in the L1 cache flushed, and it''s also no longer is sitting in the L2 cache either. So when you try to use any part of the lookup table that sat in the that set, you''ll have to do a fetch from main memory, which takes about 115 clock cycles. Not only will pure verticle lines do this, but many near verticle lines, as horizontally adjacent pixels will also tend to fall in the same cache line. Also many lines with a slope somewhere between verticle and between 1, will cause entire cache sets to be flushed, which will induce the less costly L1 cache miss on table lookup, which is still on the order of 10 additional cycles. And after three or four near horizontal line calls, it seems likely that previously unused elements of the lookup table will also be flushed from the L1 cache. So it seems likely that for random lines, the you''ll almost always have to load the lookup table value from the L2 cache or worse.

Also things are more grim if you consider the P4, which only has a 8 KB L1 cache, or the PII where the cache miss penalities are much higher.


SiCrane, thanks for your explaination. I still cannot agree with you

First I''m assuming we''re reading the pixel first before writing - else it wont be loaded into cache.
Second, for a maximum length vertical line, you have 768 pixels. Every cache line is 32bytes. 32*768=24K cache needed. You''re right when we have a small cache. (I assumed we had atleast 128K, like the "old" Pentiums). Isn''t the cache nowadays not getting smarter? Like overwrite the least used mem? I''m not sure about this though.
The third thing - using a 1024x768 lookup table is a bit overdoing it. I would suggest a lookup table containing the values of the 1/DeltaWidth or 1/DeltaHeight. And yes - we would need an extra multiply to put this value to use though (dx/dy or dy/dx).

Just some code - not really important
    
FLD LookupTbl[eax*4] ;eax=dx ;1 cycle
FMUL [dy] ;1 cycle
; do some integer thing ;1 cycle loss due to FISTP
FISTP [result] ; 3 cycles (or was it 5?)
MOV eax, [result]; paired instruction 1/2 cycle for loading in register again

; total of 6.5 CPU cycles needed for calculation



If the cache is only 8K in size, using this cache for drawing (including read before write) would make it real slow.
If we don''t use the cache for writing to the screen, the bit lost to our lookup table is acceptable.

Am I wrong?

Is it true the size of the cache is decreasing in newer computers? Why? To make a small bit faster, or the costs?

Some last words:
One thing is for sure: If your lines are very short - Bresenham is your solution!

Share this post


Link to post
Share on other sites
SiCrane    11839
The L1 cache has always been minuscile. The L2 cache is what you are thinking about when you say the old pentiums had 128KB. However, what''s really relevent is the cache organization. The L1 cache on a PIII Katmai core is 4 way set associative. Which means that any given memory location is limited to one of four cache lines. And while the cache lines are flushed in an approximation of the least most recently used algorithm, if a verticle line is drawn, it will try to load all 768 pixels in the same 4 cache lines. One of the cache lines may be preserved by the stack access in the best case, however the cache line that referenced the lookup table will most definately be flushed from the L1 cache. Even if the cache line held the currently used denominator, because the value actually used is located on the stack, not used directly from the lookup table. The L2 cache on the Katmai is 8 way associative, and again the pixels along the verticle line will share the same mapping, so all 768 will try to be loaded to the same 8 cache lines. Again one line will probably saved from being flushed by the stack, however the line that held the lookup table will be flushed as 768 cache lines compete for the remaining cache lines. If either cache was fully associative, which it seems like you''re assuming, then we wouldn''t have this problem, but fully associative caches are *slow*. (Incidently this associativity problem is why the last batch of celerons perform so poorly. They have half the L2 cache of the equivelent PIII''s, and it''s only 2 way associative.)

And it''s not like you can stop the cache from being loaded on access to the framebuffer for writing. While the PIII core uses a nice write back scheme, which gives write instructions essentially no overheaded (oversimplification), the cache is still loaded with the line written for coherence. (I''m assuming that Pentium four will be similar, but it''s not like I''ve gotten the chance to play with one yet.)

And cache sizes on the whole have been going up over the years, as the price and performance of transitors have improved. However, the Pentium four engineers decided that they needed to decrease cache latency on the L1 cache, so the L1 data cache on the P4 is being dropped to 8KB, which drops the access time from 3 cycles to 2 cycles. Considering how much time x86 programs spend moving data to and from local variables on the stack, this isn''t necessarily a bad thing. However, AMD engineers didn''t have the same opinion, so the T-Bird has a nice happy 64 KB L1 cache.

Share this post


Link to post
Share on other sites
baskuenen    124
Ok - You got me here. This is going a little over my head, and this doesn't happen often

I stopped coding in asm around the time the PII arrived. My knowledge on cache goes about that far. I had a class a year ago called "Computer Architecture" that was about the stuff you are talking about. I don't remember anything of it anymore. Maybe I should have paid more attention?

Anyway, SiCrane: where did you get that knowledge? Is there a nice practical article somewhere where I can upgrade my knowledge about this? I don't see myself doing assembly anymore, but it helps me make good C decisions. I don't find it interresting enough to read loads on it, but a nice small practical article would be fun.

Who are you SiCrane? It doesn't happen that often people know more than me! (I hope ) You've got my respect here

(to jho: So now you know to draw a line! NOT! Sorry about this )



Edited by - BasKuenen on November 2, 2000 6:21:15 PM

Share this post


Link to post
Share on other sites
Vetinari    133
Various line drawing algorithms are described in various detail in Abrash''s Black Book of Graphics Programming. It''s a good book, but a lot is becoming outdated.

As to cache information, there used to be some good stuff at arstechnica.com, but it sseems to have gone. There are numerous books written on caches though. Arstechnica has long been my favorite source of hardware overviews.


Mike

Share this post


Link to post
Share on other sites
Vetinari    133
You may want to check out Brensham''s run slice algorithm which draws lines in runs as opposed to a pixel at a time. The idea is that for any given line, a run on a major axis differs only by one pixel. So you can do a full run with no tests at all, and only an extra test/run. Plenty of room for special case optimizations also. Abrash claims this is faster in his book, but I''ve never tested it. Have you tested this one SiCrane?


Mike

Share this post


Link to post
Share on other sites
bogdanontanu    122
Hi
there is no dout Beresenham''s algo is the fastest ever...
DDA is much slowe ... about 100% because of real numbers

eh there is another one faster then Beresenham but too much complicated: Pitteway-Castle

Anyway SiCrane u got my respect also

You know much better then me...on cache
(and i am supposed to be God )

I make all my programs in ASM...i do a RTS game in full asm...and still u know better on cache...OMG...

My respects once again
Bogdan

Share this post


Link to post
Share on other sites
LilBudyWizer    491
I''ll just assume SiCrane is a hardware guy because I don''t see how you would pick that up from programming alone. He certainly has my respect either way, but neocran is my god today.

Neocran,

Thank you very much. I''ve been try to optimize a routine to produce a grey scale heightmap for a terran editor all day. 0.032 seconds was a fast as I could get it for a 510X510X32 bitmap in the worst case. Although for what it was doing that is fast enough it seemed pretty poor. Everything outside the inner loop took 0.0013 seconds and there were only five statements in the inner loop. I spent four hours getting it from 0.038 down to 0.032. I had one floating point add and a cast from floating point to integer. I could not for the life of me figure out how I was going to get rid of it and then I saw your mention of fixed point. I spent two years doing nothing but fixed point assembler and yet I forgot. That took it down to 0.0088.

Perhaps not the best in the world, but that is moving 118mb/s. I only got 180mb/s filling a 4mb buffer with a fixed value doing nothing but checking for the end of the buffer and moving the value so it''s good enough for me.

Share this post


Link to post
Share on other sites
SiCrane    11839
The irony here, of course, is that I''m not a hardware guy. I''ve always considered myself a hardcore applications side person. I almost flunked my required computer architecture classes in college; however, I did take away enough knowledge to know what parts affect performance. (I just didn''t care what algorithm was used for branch prediction, and if I want to know how many bits are in an entry for a TLB, I''ll look it up on the spec sheets.)

Vetinari: never benchmarked the run slice algorithm; it doesn''t feather well, and I need that additional information much too often to really consider it.

Share this post


Link to post
Share on other sites